Project 3 - Classification and Regression -- 2013/2014 CitiBike-NYC Data

Michael Smith, Alex Frye, Chris Boomhower ----- 4/05/2017

Image courtesy of http://newyorkeronthetown.com/, 2017

Introduction & Business Understanding

Describe the purpose of the model you are about to build

The data set again selected by our group for Lab 3 consists of Citi Bike trip history data collected and released by NYC Bike Share, LLC and Jersey Bike Share, LLC under Citi Bike's NYCBS Data Use Policy. Citi Bike is America's largest bike share program, with 10,000 bikes and 600 stations across Manhattan, Brooklyn, Queens, and Jersey City... 55 neighborhoods in all. As such, our data set's trip history includes all rental transactions conducted within the NYC Citi Bike system from July 1st, 2013 to February 28th, 2014. These transactions amount to 5,562,293 trips within this time frame. The original data set includes 15 attributes. In addition to these 15, our team was able to derive 15 more attributes for use in our classification efforts, some attributes of which are NYC weather data which come from Carbon Dioxide Information Analysis Center (CDIAC). These data are merged with the Citi Bike data to provide environmental insights into rider behavior.

The trip data was collected via Citi Bike's check-in/check-out system among 330 of its stations in the NYC system as part of its transaction history log. While the non-publicized data likely includes further particulars such as rider payment details, the publicized data is anonymized to protect rider identity while simultaneously offering bike share transportation insights to urban developers, engineers, academics, statisticians, and other interested parties. The CDIAC data, however, was collected by the Department of Energy's Oak Ridge National Laboratory for research into global climate change. While basic weather conditions are recorded by CDIAC, as included in our fully merged data set, the organization also measures atmospheric carbon dioxide and other radiatively active gas levels to conduct their research efforts.

Our team has taken particular interest in this data set as some of our team members enjoy both recreational and commute cycling. By combining basic weather data with Citi Bike's trip data, our intent in this lab is to: 1) Fit clusters describing both customer and subscriber user types 2) Build a classification model for both customer and subscriber clusters 3) Predict customer clusters for subscriber observations and vice-versa providing a Classification-Ready dataset for predicting whether riders are more likely to be (or become) Citi Bike subscribers based on ride environmental conditions, the day of the week for his/her trip, trip start and end locations, the general time of day (i.e. morning, midday, afternoon, evening, night) of his/her trip, his/her age and gender, customer/subscriber cluster features, etc., and 4) provide a demonstration on how we would implement/deploy these clusters to a new data entry for use in a classification model real-time. Due to the exhaustive number of observations in the original data set (5,562,293), a sample of 500,000 is selected to achieve these goals (as described further in the sections below).

Data Understanding 1

Describe the meaning and type of data

Before diving into each attribute in detail, one glaring facet of this data set that needs mentioning is its inherent time-series nature. By no means was this overlooked when we decided upon these particular data. To mitigate the effects of time on our analysis results, we have chosen to aggregate time-centric attributes such as dates and hours of the day by replacing them with simply the day of the week or period of the day (more on these details shortly). For example, by identifying trips occurring on July 1st, 2013, not by the date of occurrence but rather the day of the week, Monday, and identifying trips on July 2nd, 2013, as occurring on Tuesday, we will be able to obtain a "big picture" understanding of trends by day of the week instead of at the date-by-date level. We understand this is not a perfect solution since the time-series component is still an underlying factor in trip activity, but it is good enough to answer the types of questions we hope to target as described in the previous section as we will be comparing all Mondays against all Tuesdays, etc.

As mentioned previously, the original data set from Citi Bike included 15 attributes. These 15 attributes and associated descriptions are provided below:

  1. tripduration - Integer - The total time (in seconds) a bike remains checked out, beginning with the start time and ending with the stop time
  2. starttime - DateTime - The date and time at which a bike was checked out, marking the start of a trip (i.e. 2/12/2014 8:16)
  3. stoptime - DateTime - The date and time at which a bike was checked back in, marking the end of a trip (i.e. 2/12/2014 8:16)
  4. start_station_id - String - A categorical number value used to identify Citi Bike stations, in this case the station from which a bike is checked out
  5. start_station_name - String - The name of the station from which a bike is checked out; most often the name of an intersection (i.e. E 39 St & 2 Ave)
  6. start_station_latitude - Float - The latitude coordinate for the station from which a bike is checked out (i.e. 40.74780373)
  7. start_station_longitude - Float - The longitude coordinate for the station from which a bike is checked out (i.e. -73.9900262)
  8. end_station_id - String - A categorical number value used to identify Citi Bike stations, in this case the station in which a bike is checked in
  9. end_station_name - String - The name of the station at which a bike is checked in; most often the name of an intersection (i.e. E 39 St & 2 Ave)
  10. end_station_latitude - Float - The latitude coordinate for the station at which a bike is checked in (i.e. 40.74780373)
  11. end_station_longitude - Float - The longitude coordinate for the station at which a bike is checked in (i.e. -73.9900262)
  12. bikeid - String - A categorical number value used to identify a particular bike; each bike in the bike share network has its own unique number
  13. usertype - String - A classifier attribute identifying a rider as a bike share subscriber or a one-time customer (i.e. Subscriber vs. Customer)
  14. birth_year - Integer - The year a rider was born (Only available for subscribed riders, however)
  15. gender - String - A categorical number value representing a rider's gender (i.e. 0 = unknown, 1 = male, 2 = female)

It is important to note that birth year and gender details are not available for "Customer" user types but rather for "Subscriber" riders only. Fortunately, these are the only missing data values among all trips in the data set. Unfortunately, however, it means that we will not be able to identify the ratio of males-to-females that are not subscribed or use age to predict subcribers vs. non-subscribers (Customers). More to this end will be discussed in the next section.

It is also worth mentioning that while attributes such as trip duration, start and end stations, bike ID, and basic rider details were collected and shared with the general public, care was taken by Citi Bike to remove trips taken by staff during system service appointments and inspections, trips to or from "test" stations which were employed during the data set's timeframe, and trips lasting less than 60 seconds which could indicate false checkout or re-docking efforts during checkin.

Because some attributes may be deemed as duplicates (i.e. start_station_id, start_station_name, and start_station_latitude/longitude for identifying station locations), we chose to extract further attributes from the base attributes at hand. Further attributes were also extracted to mitigate the effects of time. In addition, we felt increased understanding could be obtained from combining weather data for the various trips as discussed in the previous section. These additional 10 attributes are described below:

  1. LinearDistance - Integer - The distance (miles) from a start station to an end station (as a crow flies); calculated from the latitude/longitude coordinates of start/end stations
  2. DayOfWeek - String - The day of the week a trip occurs regardless of time of day, month, etc.; extracted from the starttime attribute (i.e. Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday)
  3. TimeOfDay - String - The portion of the day during which a bike was checked out; extracted from the starttime attribute (i.e. Morning, Midday, Afternoon, Evening, Night)
  4. HolidayFlag - String - A categorical binary value used to identify whether the day a trip occurred was on a holiday or not; extracted from the starttime attribute (i.e. 0 = Non-Holiday, 1 = Holiday)
  5. Age - Integer - The age of a rider at the time of a trip; calculated based on the birth_year attribute (Since only birth year is included in original Citi Bike data set, exact age at time of trip when considering birth month is not possible)
  6. PRCP - Float - The total recorded rainfall in inches on the day of a trip; merged from the CDIAC weather data set
  7. SNOW - Float - The total recorded snowfall in inches on the day of a trip; merged from the CDIAC weather data set
  8. TAVE - Integer - The average temperature throughout the day on which a trip occurs; merged from the CDIAC weather data set
  9. TMAX - Integer - The maximum temperature on the day on which a trip occurs; merged from the CDIAC weather data set
  10. TMIN - Integer - The minimum temperature on the day on which a trip occurs; merged from the CDIAC weather data set

After extracting our own attributes and merging weather data, the total number of attributes present in our final data set is 25. Only 15 are used throughout this lab, however, due to the duplicate nature of some attributes as discussed already. This final list of used attributes are tripduration, DayOfWeek, TimeOfDay, HolidayFlag, start_station_name, start_station_latitude, start_station_longitude, usertype, gender, Age, PRCP, SNOW, TAVE, TMAX, and TMIN.

Load the Data

Compiling Multiple Data Sources

To begin our analysis, we need to load the data from our source .csv files. Steps taken to pull data from the various source files are as follows:

  • For each file from CitiBike, we process each line appending manually computed columns [LinearDistance, DayOfWeek, TimeOfDay, & HolidayFlag].
  • Similarly, we load our weather data .csv file.
  • With both source file variables gathered, we append the weather data to our CitiBike data by matching on the date.
  • To avoid a 2 hour run-time in our analysis every execution, we load the final version of the data into .CSV files. Each file consists of 250,000 records to reduce file size for GitHub loads.
  • All above logic is skipped if the file "Compiled Data/dataset1.csv" already exists.

Below you will see this process, as well as import/options for needed python modules throughout this analysis.

In [1]:
import os
from geopy.distance import vincenty
import holidays
from datetime import datetime
from dateutil.parser import parse
import glob
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import gmaps
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import log_loss
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from itertools import cycle
from sklearn.metrics import roc_curve, auc
from scipy import interp
from sklearn.metrics import confusion_matrix
import itertools
from sklearn.ensemble  import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
import statsmodels.stats.api as sms

from sklearn.cluster import DBSCAN
from sklearn.cluster import SpectralClustering
from sklearn.neighbors import kneighbors_graph
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import matplotlib.cm as cmx

%load_ext memory_profiler

plt.rcParams['figure.figsize'] = (12, 6)

pd.options.mode.chained_assignment = None

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import pickle

def pickleObject(objectname, filename, filepath = "PickleFiles/"):
    fullpicklepath = "{0}{1}.pkl".format(filepath, filename)
    # Create a variable to pickle and open it in write mode
    picklefile = open(fullpicklepath, 'wb')
    pickle.dump(objectname, picklefile)
    picklefile.close()
    
def unpickleObject(filename, filepath = "PickleFiles/"):
    fullunpicklepath = "{0}{1}.pkl".format(filepath, filename)
    # Create an variable to pickle and open it in write mode
    unpicklefile = open(fullunpicklepath, 'rb')
    unpickleObject = pickle.load(unpicklefile)
    unpicklefile.close()
    
    return unpickleObject
    
def clear_display():
    from IPython.display import clear_output
C:\Users\Owner\Anaconda2\envs\Py3\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [2]:
############################################################
# Load & Merge Data from Source Files
# Parse into Compiled Files
############################################################

starttime = datetime.now()
print('Starting Source Data Load & Merge Process. \n'
      'Start Time: ' + str(starttime))

if os.path.isfile("Compiled Data/dataset1.csv"):
    print("Found the File!")
else:
    citiBikeDataDirectory = "Citi Bike Data"
    citiBikeDataFileNames = [
        "2013-07 - Citi Bike trip data - 1.csv",
        "2013-07 - Citi Bike trip data - 2.csv",
        "2013-08 - Citi Bike trip data - 1.csv",
        "2013-08 - Citi Bike trip data - 2.csv",
        "2013-09 - Citi Bike trip data - 1.csv",
        "2013-09 - Citi Bike trip data - 2.csv",
        "2013-10 - Citi Bike trip data - 1.csv",
        "2013-10 - Citi Bike trip data - 2.csv",
        "2013-11 - Citi Bike trip data - 1.csv",
        "2013-11 - Citi Bike trip data - 2.csv",
        "2013-12 - Citi Bike trip data.csv",
        "2014-01 - Citi Bike trip data.csv",
        "2014-02 - Citi Bike trip data.csv"
    ]

    weatherDataFile = "Weather Data/NY305801_9255_edited.txt"

    citiBikeDataRaw = []

    for file in citiBikeDataFileNames:
        print(file)
        filepath = citiBikeDataDirectory + "/" + file
        with open(filepath) as f:
            lines = f.read().splitlines()
            lines.pop(0)  # get rid of the first line that contains the column names
            for line in lines:
                line = line.replace('"', '')
                line = line.split(",")
                sLatLong = (line[5], line[6])
                eLatLong = (line[9], line[10])

                distance = vincenty(sLatLong, eLatLong).miles
                line.extend([distance])

                ## Monday       = 0
                ## Tuesday      = 1
                ## Wednesday    = 2
                ## Thursday     = 3
                ## Friday       = 4
                ## Saturday     = 5
                ## Sunday       = 6
                if parse(line[1]).weekday() == 0:
                    DayOfWeek = "Monday"
                elif parse(line[1]).weekday() == 1:
                    DayOfWeek = "Tuesday"
                elif parse(line[1]).weekday() == 2:
                    DayOfWeek = "Wednesday"
                elif parse(line[1]).weekday() == 3:
                    DayOfWeek = "Thursday"
                elif parse(line[1]).weekday() == 4:
                    DayOfWeek = "Friday"
                elif parse(line[1]).weekday() == 5:
                    DayOfWeek = "Saturday"
                else:
                    DayOfWeek = "Sunday"
                line.extend([DayOfWeek])

                ##Morning       5AM-10AM
                ##Midday        10AM-2PM
                ##Afternoon     2PM-5PM
                ##Evening       5PM-10PM
                ##Night         10PM-5AM

                if parse(line[1]).hour >= 5 and parse(line[1]).hour < 10:
                    TimeOfDay = 'Morning'
                elif parse(line[1]).hour >= 10 and parse(line[1]).hour < 14:
                    TimeOfDay = 'Midday'
                elif parse(line[1]).hour >= 14 and parse(line[1]).hour < 17:
                    TimeOfDay = 'Afternoon'
                elif parse(line[1]).hour >= 17 and parse(line[1]).hour < 22:
                    TimeOfDay = 'Evening'
                else:
                    TimeOfDay = 'Night'
                line.extend([TimeOfDay])

                ## 1 = Yes
                ## 0 = No
                if parse(line[1]) in holidays.UnitedStates():
                    holidayFlag = "1"
                else:
                    holidayFlag = "0"
                line.extend([holidayFlag])

                citiBikeDataRaw.append(line)
            del lines

    with open(weatherDataFile) as f:
        weatherDataRaw = f.read().splitlines()
        weatherDataRaw.pop(0)  # again, get rid of the column names
        for c in range(len(weatherDataRaw)):
            weatherDataRaw[c] = weatherDataRaw[c].split(",")
            # Adjust days and months to have a leading zero so we can capture all the data
            if len(weatherDataRaw[c][2]) < 2:
                weatherDataRaw[c][2] = "0" + weatherDataRaw[c][2]
            if len(weatherDataRaw[c][0]) < 2:
                weatherDataRaw[c][0] = "0" + weatherDataRaw[c][0]

    citiBikeData = []

    while (citiBikeDataRaw):
        instance = citiBikeDataRaw.pop()
        date = instance[1].split(" ")[0].split("-")  # uses the start date of the loan
        for record in weatherDataRaw:
            if (str(date[0]) == str(record[4]) and str(date[1]) == str(record[2]) and str(date[2]) == str(record[0])):
                instance.extend([record[5], record[6], record[7], record[8], record[9]])
                citiBikeData.append(instance)

    del citiBikeDataRaw
    del weatherDataRaw

    # Final Columns:
    #  0 tripduration
    #  1 starttime
    #  2 stoptime
    #  3 start station id
    #  4 start station name
    #  5 start station latitude
    #  6 start station longitude
    #  7 end station id
    #  8 end station name
    #  9 end station latitude
    # 10 end station longitude
    # 11 bikeid
    # 12 usertype
    # 13 birth year
    # 14 gender
    # 15 start/end station distance
    # 16 DayOfWeek
    # 17 TimeOfDay
    # 18 HolidayFlag
    # 19 PRCP
    # 20 SNOW
    # 21 TAVE
    # 22 TMAX
    # 23 TMIN

    maxLineCount = 250000
    lineCounter = 1
    fileCounter = 1
    outputDirectoryFilename = "Compiled Data/dataset"
    f = open(outputDirectoryFilename + str(fileCounter) + ".csv", "w")
    for line in citiBikeData:
        if lineCounter == 250000:
            print(f)
            f.close()
            lineCounter = 1
            fileCounter = fileCounter + 1
            f = open(outputDirectoryFilename + str(fileCounter) + ".csv", "w")
        f.write(",".join(map(str, line)) + "\n")
        lineCounter = lineCounter + 1

    del citiBikeData

endtime = datetime.now()
print('Ending Source Data Load & Merge Process. \n'
      'End Time: ' + str(starttime) + '\n'
                                      'Total RunTime: ' + str(endtime - starttime))
Starting Source Data Load & Merge Process. 
Start Time: 2017-04-16 21:54:03.923234
Found the File!
Ending Source Data Load & Merge Process. 
End Time: 2017-04-16 21:54:03.923234
Total RunTime: 0:00:00.001976
Loading the Compiled Data from CSV

Now that we have compiled data files from both CitiBike and the weather data, we want to load that data into a Pandas dataframe for analysis. We iterate and load each file produced above, then assign each column with their appropriate data types. Additionally, we compute the Age Column after producing a default value for missing "Birth Year" values. This is discussed further in the Data Preparation 1 section.

In [3]:
%%time
############################################################
# Load the Compiled Data from CSV
############################################################

# Create CSV Reader Function and assign column headers
def reader(f, columns):
    d = pd.read_csv(f)
    d.columns = columns
    return d


# Identify All CSV FileNames needing to be loaded
path = r'Compiled Data'
all_files = glob.glob(os.path.join(path, "*.csv"))

# Define File Columns
columns = ["tripduration", "starttime", "stoptime", "start_station_id", "start_station_name",
           "start_station_latitude",
           "start_station_longitude", "end_station_id", "end_station_name", "end_station_latitude",
           "end_station_longitude", "bikeid", "usertype", "birth year", "gender", "LinearDistance", "DayOfWeek",
           "TimeOfDay", "HolidayFlag", "PRCP", "SNOW", "TAVE", "TMAX", "TMIN"]

# Load Data
CitiBikeDataCompiled = pd.concat([reader(f, columns) for f in all_files])

# Replace '\N' Birth Years with Zero Values
CitiBikeDataCompiled["birth year"] = CitiBikeDataCompiled["birth year"].replace(r'\N', '0')

# Convert Columns to Numerical Values
CitiBikeDataCompiled[['tripduration', 'birth year', 'LinearDistance', 'PRCP', 'SNOW', 'TAVE', 'TMAX', 'TMIN']] \
    = CitiBikeDataCompiled[['tripduration', 'birth year', 'LinearDistance', 'PRCP', 'SNOW', 'TAVE', 'TMAX',
                            'TMIN']].apply(pd.to_numeric)

# Convert Columns to Date Values
CitiBikeDataCompiled[['starttime', 'stoptime']] \
    = CitiBikeDataCompiled[['starttime', 'stoptime']].apply(pd.to_datetime)

# Compute Age: 0 Birth Year = 0 Age ELSE Compute Start Time Year Minus Birth Year
CitiBikeDataCompiled["Age"] = np.where(CitiBikeDataCompiled["birth year"] == 0, 0,
                                       CitiBikeDataCompiled["starttime"].dt.year - CitiBikeDataCompiled[
                                           "birth year"])

# Convert Columns to Str Values
CitiBikeDataCompiled[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag', 'gender']] \
    = CitiBikeDataCompiled[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag', 'gender']].astype(str)
Wall time: 1min 45s
In [4]:
%%time
print(len(CitiBikeDataCompiled))
display(CitiBikeDataCompiled.head())
5562293
tripduration starttime stoptime start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bikeid usertype birth year gender LinearDistance DayOfWeek TimeOfDay HolidayFlag PRCP SNOW TAVE TMAX TMIN Age
0 308 2014-02-28 23:59:10 2014-03-01 00:04:18 353 S Portland Ave & Hanson Pl 40.685396 -73.974315 365 Fulton St & Grand Ave 40.682232 -73.961458 14761 Subscriber 1982 1 0.709731 Friday Night 0 0.0 0.0 17 24 9 32
1 304 2014-02-28 23:58:17 2014-03-01 00:03:21 497 E 17 St & Broadway 40.737050 -73.990093 334 W 20 St & 7 Ave 40.742388 -73.997262 17112 Subscriber 1968 1 0.526555 Friday Night 0 0.0 0.0 17 24 9 46
2 1355 2014-02-28 23:57:55 2014-03-01 00:20:30 470 W 20 St & 8 Ave 40.743453 -74.000040 302 Avenue D & E 3 St 40.720828 -73.977932 15608 Subscriber 1985 2 1.945255 Friday Night 0 0.0 0.0 17 24 9 29
3 848 2014-02-28 23:57:13 2014-03-01 00:11:21 498 Broadway & W 32 St 40.748549 -73.988084 432 E 7 St & Avenue A 40.726218 -73.983799 17413 Subscriber 1976 1 1.557209 Friday Night 0 0.0 0.0 17 24 9 38
4 175 2014-02-28 23:57:12 2014-03-01 00:00:07 383 Greenwich Ave & Charles St 40.735238 -74.000271 284 Greenwich Ave & 8 Ave 40.739017 -74.002638 15220 Subscriber 1956 1 0.288829 Friday Night 0 0.0 0.0 17 24 9 58
Wall time: 550 ms

Variable Definition and Preparation

Measurable Data Quality Factors

When analyzing our final dataset for accurate measures, there are a few key factors we can easily verify/research:

  • Computational Accuracy: Ensure data attributes added by computation are correct

    • TimeOfDay
    • DayOfWeek
    • HolidayFlag
  • Missing Data from Source

  • Duplicate Data from Source
  • Outlier Detection
  • Sampling to 500,000 Records for further analysis
Immesurable Data Quality Factors

Although we are able to research these many factors, one computation may still be lacking information in this dataset. Our LinearDistance attribute computes the distance from one lat/long coordinate to another. This attribute does not however tell us the 'true' distance a biker traveled before returning the bike. Some bikers may be biking for exercise around the city with various turns and loops, whereas others travel the quickest path to their destination. Because our dataset limits us to start and end locations, we do not have enough information to accurately compute distance traveled. Because of this, we have named the attribute "LinearDistance" rather than "DistanceTraveled".

Below we will walk through the process of researching the 'Measureable' data quality factors mentioned above.

Computational Accuracy:TimeOfDay

To help mitigate challenges with time series data, we have chosen to break TimeOfDay into 5 categories. These Categories are broken down below:

  • Morning 5 AM - 10 AM
  • Midday 10 AM - 2 PM
  • Afternoon 2 PM - 5 PM
  • Evening 5 PM - 10 PM
  • Night 10 PM - 5 AM

To ensure that these breakdowns are accurately computed, we pulled the distinct list of TimeOfDay assignments by starttime hour. Looking at the results below, we can verify that this categorization is correctly being assigned.

In [5]:
%%time
    # Compute StartHour from StartTime
CitiBikeDataCompiled["StartHour"] = CitiBikeDataCompiled["starttime"].dt.hour

    # Compute Distinct Combinations of StartHour and TimeOfDay
DistinctTimeOfDayByHour = CitiBikeDataCompiled[["StartHour", "TimeOfDay"]].drop_duplicates().sort_values("StartHour")

    # Print
display(DistinctTimeOfDayByHour)

    #Clean up Variables
del CitiBikeDataCompiled["StartHour"]
StartHour TimeOfDay
9517 0 Night
9482 1 Night
9470 2 Night
9457 3 Night
9437 4 Night
9362 5 Morning
9147 6 Morning
8642 7 Morning
7644 8 Morning
6866 9 Morning
6452 10 Midday
6113 11 Midday
5696 12 Midday
5228 13 Midday
4734 14 Afternoon
4199 15 Afternoon
3460 16 Afternoon
2405 17 Evening
1464 18 Evening
851 19 Evening
503 20 Evening
298 21 Evening
128 22 Night
0 23 Night
Wall time: 5.25 s
Computational Accuracy:DayOfWeek

In order to verify our computed DayOfWeek column, we have chosen one full week from 12/22/2013 - 12/28/2013 to validate. Below is a calendar image of this week to baseline our expected results:

To verify these 7 days, we pulled the distinct list of DayOfWeek assignments by StartDate (No Time). If we can verify one full week, we may justify that the computation is correct across the entire dataset. Looking at the results below, we can verify that this categorization is correctly being assigned.

In [6]:
%%time
    # Create DataFrame for StartTime, DayOfWeek within Date Threshold
CitiBikeDayOfWeekTest = CitiBikeDataCompiled[(CitiBikeDataCompiled['starttime'].dt.year == 2013)
                                             & (CitiBikeDataCompiled['starttime'].dt.month == 12)
                                             & (CitiBikeDataCompiled['starttime'].dt.day >= 22)
                                             & (CitiBikeDataCompiled['starttime'].dt.day <= 28)][
    ["starttime", "DayOfWeek"]]

    # Create FloorDate Variable as StartTime without the timestamp
CitiBikeDayOfWeekTest["StartFloorDate"] = CitiBikeDayOfWeekTest["starttime"].dt.strftime('%m/%d/%Y')

    # Compute Distinct combinations
DistinctDayOfWeek = CitiBikeDayOfWeekTest[["StartFloorDate", "DayOfWeek"]].drop_duplicates().sort_values(
    "StartFloorDate")

    #Print
display(DistinctDayOfWeek)

    # Clean up Variables
del CitiBikeDayOfWeekTest
del DistinctDayOfWeek
StartFloorDate DayOfWeek
107323 12/22/2013 Sunday
100367 12/23/2013 Monday
89342 12/24/2013 Tuesday
86082 12/25/2013 Wednesday
76319 12/26/2013 Thursday
64599 12/27/2013 Friday
52577 12/28/2013 Saturday
Wall time: 2.72 s
Computational Accuracy:HolidayFlag

Using the same week as was used to verify DayOfWeek, w can test whether HolidayFlag is set correctly for the Christmas Holiday. We pulled the distinct list of HolidayFlag assignments by StartDate (No Time). If we can verify one holiday, we may justify that the computation is correct across the entire dataset. Looking at the results below, we expect to see HolidayFlag = 1 only for 12/25/2013.

In [7]:
%%time
    # Create DataFrame for StartTime, HolidayFlag within Date Threshold
CitiBikeHolidayFlagTest = CitiBikeDataCompiled[(CitiBikeDataCompiled['starttime'].dt.year == 2013)
                                             & (CitiBikeDataCompiled['starttime'].dt.month == 12)
                                             & (CitiBikeDataCompiled['starttime'].dt.day >= 22)
                                             & (CitiBikeDataCompiled['starttime'].dt.day <= 28)][
    ["starttime", "HolidayFlag"]]

    # Create FloorDate Variable as StartTime without the timestamp
CitiBikeHolidayFlagTest["StartFloorDate"] = CitiBikeHolidayFlagTest["starttime"].dt.strftime('%m/%d/%Y')

    # Compute Distinct combinations
DistinctHolidayFlag = CitiBikeHolidayFlagTest[["StartFloorDate", "HolidayFlag"]].drop_duplicates().sort_values(
    "StartFloorDate")
    
    #Print
display(DistinctHolidayFlag)
    
    # Clean up Variables
del CitiBikeHolidayFlagTest
del DistinctHolidayFlag
StartFloorDate HolidayFlag
107323 12/22/2013 0
100367 12/23/2013 0
89342 12/24/2013 0
86082 12/25/2013 1
76319 12/26/2013 0
64599 12/27/2013 0
52577 12/28/2013 0
Wall time: 2.18 s
Missing Data from Source

Accounting for missing data is a crucial part of our analysis. At first glance, it is very apparent that we have a large amount of missing data in the Gender and Birth Year attributes from our source CitiBike Data. We have already had to handle for missing Birth Year attributes while computing "Age" in our Data Load from CSV section of this paper. This was done to create a DEFAULT value of (0), such that future computations do not result in NA values as well. Gender has also already accounted for missing values with a default value of (0) by the source data. Although we have handled these missing values with a default, we want to ensure that we 'need' these records for further analysis - or if we may remove them from the dataset. Below you will see a table showing the frequency of missing values(or forced default values) by usertype. We noticed that of the 4,881,384 Subscribing Members in our dataset, only 295 of them were missing Gender information, whereas out of the 680,909 Customer Users (Non-Subscribing), there was only one observation where we had complete information for both Gender and Birth Year. This quickly told us that removing records with missing values is NOT an option, since we would lose data for our entire Customer Usertype. These attributes, as well as Age (Computed from birth year) will serve as difficult for use in a classification model attempting to predict usertype.

We have also looked at all other attributes, and verified that there are no additional missing values in our dataset. A missing value matrix was produced to identify if there were any gaps in our data across all attributes. Due to the conclusive results in our data, no missing values present, we removed this lackluster visualization from the report.

In [8]:
%%time
NADatatestData = CitiBikeDataCompiled[["usertype","gender", "birth year"]]

NADatatestData["GenderISNA"] = np.where(CitiBikeDataCompiled["gender"] == '0', 1, 0)
NADatatestData["BirthYearISNA"] = np.where(CitiBikeDataCompiled["birth year"] == 0, 1,0)

NAAggs = pd.DataFrame({'count' : NADatatestData.groupby(["usertype","GenderISNA", "BirthYearISNA"]).size()}).reset_index()

display(NAAggs)

del NAAggs
usertype GenderISNA BirthYearISNA count
0 Customer 0 0 1
1 Customer 0 1 42
2 Customer 1 0 73
3 Customer 1 1 680793
4 Subscriber 0 0 4881089
5 Subscriber 1 0 295
Wall time: 2.3 s
Duplicate Data from Source

To ensure that there are no duplicate records in our datasets, we ensured that the number of records before and after removing potential duplicates were equal to each other. This test passed, thus we did not need any alterations to the dataset based on duplicate records.

In [9]:
%%time
len(CitiBikeDataCompiled) == len(CitiBikeDataCompiled.drop_duplicates())
Wall time: 34.1 s
Out[9]:
True
Outlier Detection

Trip Duration

In analyzing a Box Plot on trip duration values, we find extreme outliers present. With durations reaching up to 72 days in the most extreme instance, our team decided to rule out any observation with a duration greater than a 24 hour period. The likelihood of an individual sleeping overnight after their trip with the bike still checked out is much higher after the 24 hour period. This fact easily skews the results of this value, potentially hurting any analysis done. We move forward with removing a total of 457 observations based on trip duration greater than 24 hours (86,400 seconds).

In [10]:
%%time
%matplotlib inline

#CitiBikeDataCompiledBackup = CitiBikeDataCompiled
#CitiBikeDataCompiled = CitiBikeDataCompiledBackup

    # BoxPlot tripDuration - Heavy Outliers!
sns.boxplot(y = "tripduration", data = CitiBikeDataCompiled)
sns.despine()
    
    # How Many Greater than 24 hours?
print(len(CitiBikeDataCompiled[CitiBikeDataCompiled["tripduration"]>86400]))

    # Remove > 24 Hours
CitiBikeDataCompiled = CitiBikeDataCompiled[CitiBikeDataCompiled["tripduration"]<86400]
457
Wall time: 12.8 s
Compiler : 198 ms

Once outliers are removed, we run the boxplot again, still seeing skewness in results. To try to mitigate this left-skew distribution, we decide to take a log transform on this attribute.

In [11]:
%%time
%matplotlib inline

    # BoxPlot Trip Duration AFTER removal of outliers
sns.boxplot(y = "tripduration", data = CitiBikeDataCompiled)
sns.despine()

    # Log Transform Column Added
CitiBikeDataCompiled["tripdurationLog"] = CitiBikeDataCompiled["tripduration"].apply(np.log)
Wall time: 1.75 s
In [12]:
%%time
%matplotlib inline

    # BoxPlot TripDurationLog
sns.boxplot(y = "tripdurationLog", data = CitiBikeDataCompiled)
sns.despine()
Wall time: 1.74 s

Age

Similarly, we look at the distribution of Age in our dataset. Interestingly, it seems we have several outlier observations logging their birth year far enough back to cause their age to compute as 115 years old. Possible reasons for these outlier ages could be data entry errors by those who do not enjoy disclosing personal information, or possibly account sharing between a parent and a child - rendering an inaccurate data point to those actually taking the trip. Our target demographic for this study are those individuals under 65 years of age, given that they are the likely age groups to be in better physical condition for the bike share service. Given this target demographic, and the poor entries causing extreme outliers, we have chosen to limit out dataset to observations up to 65 years of age. This change removed an additional 53824 records from the dataset.

In [13]:
%%time
%matplotlib inline

    # BoxPlot Age - Outliers!
sns.boxplot(y = "Age", data = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]!= 0])
sns.despine()
    
    # How Many Greater than 65 years old?
print(len(CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]>65]))

    # Remove > 65 years old
CitiBikeDataCompiled = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]<=65]
53824
Wall time: 7.99 s
In [14]:
%%time
%matplotlib inline

    # BoxPlot Age - removed Outliers!
sns.boxplot(y = "Age", data = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]!= 0])
sns.despine()
Wall time: 4.54 s
Record Sampling to 500,000 Records

Given the extremely large volume of data collected, we have have decided to try to sample down to ~ 1/10th of the original dataset for a total of 500,000 records. Before taking this action, however, we wanted to ensure that we keep data proportions reasonable for analysis and ensure we do not lose any important demographic in our data.

Below we compute the percentage of our Dataset that comprises of Customers vs. Subscribers. We note, that 87.6% of the data consists of Subscriber users whereas the remaining 12.4% resemble Customers.

In [15]:
%%time
%matplotlib inline
UserTypeDist = pd.DataFrame({'count' : CitiBikeDataCompiled.groupby(["usertype"]).size()}).reset_index()
display(UserTypeDist)

UserTypeDist.plot.pie(y = 'count', labels = ['Customer', 'Subscriber'], autopct='%1.1f%%')
usertype count
0 Customer 680796
1 Subscriber 4827216
Wall time: 1min 5s
Compiler : 189 ms

In our Sample Dataset for this analysis, we have chosen to oversample the Customer observations to force a 50/50 split between the two classifications. This will help reduce bias in the model towards Subscribers simply due to the distribution of data in the sample.

We are able to compute the sample size for each usertype and then take a random sample within each group. Below you will see that our sampled distribution matches the chosen 50/50 split between Customers and Subscriber Usertypes.

In [16]:
%%time
SampleSize = 500000

CustomerSampleSize_Seed   = int(round(SampleSize * 50.0 / 100.0,0))
SubscriberSampleSize_Seed = int(round(SampleSize * 50.0 / 100.0,0))

CitiBikeCustomerDataSampled = CitiBikeDataCompiled[CitiBikeDataCompiled["usertype"] == 'Customer'].sample(n=CustomerSampleSize_Seed, replace = False, random_state = CustomerSampleSize_Seed)
CitiBikeSubscriberDataSampled = CitiBikeDataCompiled[CitiBikeDataCompiled["usertype"] == 'Subscriber'].sample(n=SubscriberSampleSize_Seed, replace = False, random_state = SubscriberSampleSize_Seed)

CitiBikeDataSampled_5050 = pd.concat([CitiBikeCustomerDataSampled,CitiBikeSubscriberDataSampled])

print(len(CitiBikeDataSampled_5050))

UserTypeDist = pd.DataFrame({'count' : CitiBikeDataSampled_5050.groupby(["usertype"]).size()}).reset_index()
display(UserTypeDist)

UserTypeDist.plot.pie(y = 'count', labels = ['Customer', 'Subscriber'], autopct='%1.1f%%')
500000
usertype count
0 Customer 250000
1 Subscriber 250000
Wall time: 7.24 s

Data Understanding 2 - Visualize important attributes

To re-iterate, our main objectives in analyzing these data are to determine which attributes have greatest bearing on predicting a rider's type (Customer vs. Subscriber) and to gain a better understanding of rider behavior as a function of external factors. Many attributes in this data set may eventually be used in subsequent analyses to answer these questions. The primary attributes on which we will focus our attention in this section, however, are as follows:

  • Starting Location
  • Day of the Week
  • Time of Day
  • Trip Duration (both log and non-log)
  • Linear Distance
  • Gender
  • Age

Over the course of this section, we will review these top attributes in some detail and discuss the value of using our chosen visualizations. Note also that merged weather data is of significant interest as well.

Geophysical Start Stations HeatMap

Before discussing the following heatmap in detail, it is worth noting some special steps required to use the gmaps module in Python in case the reader is interested in rendering our code to plot data on top of Google's maps (Note full instructions are available at https://media.readthedocs.org/pdf/jupyter-gmaps/latest/jupyter-gmaps.pdf)

Besides having Jupyter Notebook installed on one's computer with extensions enabled (default if using Anaconda) and installing the gmaps module using pip, the following line should be run from within the command terminal. This is only to be done once and should be done when Jupyter Notebook is not running.

$ jupyter nbextension enable --py gmaps

In addition to running the above line in the command prompt, a Standard Google API user key will need obtained from https://developers.google.com/maps/documentation/javascript/get-api-key. This only needs done once and is necessary to pull the Google map data into the Jupyter Notebook environment. The key is entered in the gmaps.configure() line as shown in the below cell. We have provided our own private key in the meantime for the reader's convenience.

Now on to the data visualization... This geological heatmap visualization is interactive; however, the kernel must run the code block each time our Jupyter Notebook file is opened due to the API key requirement. Therefore, we've captured some interesting views to aid in our discussion and have included them as embedded images.

The start station heatmap represents the start station location data via attributes start_station_latitude and start_station_longitude. It identifies areas of highest and lowest concentration for trip starts. The location data is important as it helps us understand where the areas of highest activity are and, as will be seen in one of our later sections, will play an important role in identifying riders as regular customers or subscribers.

In [17]:
%%time

gmaps.configure(api_key="AIzaSyAsBi0MhgoQWfoGMSl5UcD-vR6H76cntxg") # Load private Google API key

locations = CitiBikeDataSampled_5050[['start_station_latitude', 'start_station_longitude']].values.tolist()

m = gmaps.Map()
heatmap_layer = gmaps.Heatmap(data = locations)
m.add_layer(heatmap_layer)
Wall time: 9.89 s

An overall view quickly reveals that station data was only provided for areas of NYC south of Manhattan and mostly north of Brooklyn. This could either mean that the bike share program had not yet expanded into these other areas at the time of data collection or that the data simply wasn't included (as mentioned previously, many test sites were being used during this time frame but CitiBike did not include them with this data set).

Within the range of trip start frequency from the least number of trips (green) to the most trips (red), green and yellow indicate low to medium trip activity in most areas. However, higher pockets of concentration do exist in some places. We will attempt to put this visualization to good use by focusing in on one of these hotspots.

In [18]:
m

A prominant heatspot occurs just east of Bryant Park and the Midtown map label. Zooming into this area (via regular Google Map controls as the rendered visual is interactive) allows for a closer look. A snapshot of this zoomed in image is embedded below. The hotspot seems slightly elongated and stands out from among the other stations. Zooming in further will help to understand why this is and may shed some light on the higher activity in this area.

Zooming in to this area further helps us see that two stations are very close together. Even so, why might there be such high rider activity at these stations? This higher activity is likely affected by the stations' proximity to the famous Grand Central Station. As commuters and recreationalists alike arrive by train at Grand Central, it is natural that many of them may choose to continue their journey via the two closest bike share stations nearby. When the northernmost bike share station runs out of bikes, riders likely go to the next station to begin their ride instead.

By understanding the dynamics of geographical activity within this data set and the amenities that surround each station, we will be able to more efficiently leverage the data to make our classification and regression predictions.

Geographic Heatmap Comparing Customer vs. Subscriber Start Station Activity

After visualizing the overall dataset locations with a heatmap over NYC, we decided to take the visualization one step further. This time, we broke the dataset into two segments: Customer vs. Subscriber. Below is two separate gmap heatmaps containing geographic densities for each usertype. What we found assisted our theories on customer vs. subscriber usage tendencies. Seen first, the Customer heatmap overall contains much fewer dense regions. This helps to confirm our suspicions infering Customer bikers as less "routine" than subscribing bikers. When looking around for the most dense region in this heatmap, one point stuck out as particularly interesting: The Zoo. When comparing this region on the Subscriber gmap, we did not see the same type of traffic! This helps assist our theories that customer bikers use the service more for events, shopping, or one-time use convenience. On the subscriber gmap, the most dense region, is that near the Grand Central Station as discussed earlier - assisting in the opposing theory for subscribing members as routine trips to work, groceries, etc. as they consistently use the bike share service as a means to reach the metro station.

Customer Users

In [19]:
%%time
customerData = CitiBikeDataSampled_5050.query('usertype == "Customer"')
customerLoc = customerData[['start_station_latitude', 'start_station_longitude']].values.tolist()

cmap = gmaps.Map()
customer_layer = gmaps.Heatmap(data=customerLoc)#, fill_color="red", stroke_color="red", scale=3)
cmap.add_layer(customer_layer)
Wall time: 4.76 s
In [20]:
cmap

Subscriber Users

In [21]:
%%time
subscriberData = CitiBikeDataSampled_5050.query('usertype == "Subscriber"')
subscriberLoc = subscriberData[['start_station_latitude', 'start_station_longitude']].values.tolist()

smap = gmaps.Map()
subscriber_layer = gmaps.Heatmap(data=subscriberLoc)#, fill_color="green", stroke_color="green", scale=2)
smap.add_layer(subscriber_layer)
Wall time: 4.94 s
In [22]:
smap

Trip Duration and Linear Distance vs Weather by Customer/Subscriber

Because we were able to bring together historical weather data for the dates we had in our records, we wanted to explore the relationship these variables had with our usertype status. If subscribers were regularly using the bikes for commuting as we've begun to see, then weather wouldn't impact their rental stastics as much as customers who appear to be primarily opportunistic in their usage.

A quick cursory glance reveals a noticeable difference in bike rentals in regards to low temperatures, precipitation, and snowfall. While true, there are fewer customers than subscribers, we're concerned primarily with the spread or distribution of the plot points rather than the quantity. And we can see that on the customer pair plots that there are fewer points distributed across the lower temperature ranges and higher precipitation/snowfall ranges. The distributions pick back up at higher temperatures and lower precipitation points between the two usertypes.

If stations consistently see use during "bad" weather, then those stations could be identified as subscriber stations. Further, if certain customers are found making the same trips consistently in all weather types, then they could be pushed for subscription.

Customer Users

In [23]:
%%time
sns.pairplot(CitiBikeDataSampled_5050.query("usertype == 'Customer'"), x_vars=["PRCP","SNOW","TAVE","TMAX","TMIN"], y_vars=["tripduration","tripdurationLog","LinearDistance"])
Wall time: 13.3 s
Out[23]:
<seaborn.axisgrid.PairGrid at 0x1e671a1ad30>

Subscriber Users

In [24]:
%%time
sns.pairplot(CitiBikeDataSampled_5050.query("usertype == 'Subscriber'"), x_vars=["PRCP","SNOW","TAVE","TMAX","TMIN"], y_vars=["tripduration","tripdurationLog","LinearDistance"])
Wall time: 12.2 s
Out[24]:
<seaborn.axisgrid.PairGrid at 0x1e6481dd6a0>
Customer vs. Subscriber Trip Duration by Day of the Week Split Violin Plot

Almost universally, across every day of the week, customers appear to have a higher trip duration than subscribers. While additional analysis will be required to confirm this, it's possible that one explanation is that subscribers can freely take and return their bikes which means that they're more willing to make shorter trips versus customers that pay each time they want to rent a bike in the first place. An alternate explanation, based on what we know in regards to the relationship between trip duration and linear distance traveled, is that subscribers are using the bikes for commuting to and from specific locations. This would result in a lower trip duration than customers that might use their bikes for general travel around the city. This possibility is corroborated by the decrease in activity on the weekends by subscribers.

Identifying the point at which a customer might become a subscriber using this data would probably include monitoring weekday activity and trip duration. If a station has a lot of customers with trip durations similar to those of subscribers, then that station would be a good location to do a focused advertisement of the benefits of subscribing.

In [25]:
%%time
sns.set(style="whitegrid", palette="pastel", color_codes=True)
plt.rcParams['figure.figsize'] = (12, 6)
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="DayOfWeek", y="tripdurationLog", hue="usertype", data=CitiBikeDataSampled_5050, split=True,
               inner="quart", palette={"Subscriber": "g", "Customer": "y"})
sns.despine(left=True)
Wall time: 5.14 s
Customer vs. Subscriber Linear Trip Distance by Day of the Week Split Violin Plot

Unlike trip duration, the linear distance between start and end stations for both customers and subscribers appear to be similar in regards to means and are close in their quartiles. But what's noticeable here, is that customers are more widely distributed in how far or near they ride, with a significant increase in the number of customers that return their bikes to the station they started from.

Further analysis will be necessary to explore the statistical significance of these differences, but it would be possible to identify those stations that are frequented by subscribers and assume that most stations within the first standard deviation of the linear distance found below to be considered "subscriber stations" and then seen which stations are outside of those zones to further build up the messaging encouraging subscription. Furthermore, by identifying those "hot zones" it's possible to rotate out bikes to increase their longevity.

In [26]:
%%time
sns.set(style="whitegrid", palette="pastel", color_codes=True)
plt.rcParams['figure.figsize'] = (12, 6)
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="DayOfWeek", y="LinearDistance", hue="usertype", data=CitiBikeDataSampled_5050, split=True,
               inner="quart", palette={"Subscriber": "g", "Customer": "y"})
sns.despine(left=True)
Wall time: 2.85 s

Prepping Data for Analysis

Now that we have the dataset sampled, we still have some legwork necessary to convert our categorical attributes into integer values. Below we walk through this process for the following Attributes:

  • start_station_name
  • end_station_name
  • gender
  • DayOfWeek
  • TimeOfDay

Once these 5 attributes have been encoded using OneHotEncoding, we have added 79 attributes into our dataset for analysis in our model.

Start Station Name

Initially including all start (and end) locations resulted in excessive system resource loading, later during randomized principal component computations, that froze our personal workstations and eventually ended with Python 'MemoryError' messaging. Therefore, due to the extremely large quantity of start stations in our dataset (330 stations), we were required to reduce this dimension down to a manageable size manually. Through trial and error on top frequency stations, we have chosen to reduce this number down to ~ 10% its original number. By identifying the top 20 start stations for Subscribers / Customers separately, we found that there were 9 overlapping stations, producing a final list of 31 stations. While encoding our start_station_name integer columns, we limit the number of columns to these stations identified.

In [27]:
%%time
    
    #How many Start Stations are there?
print(len(CitiBikeDataSampled_5050["start_station_name"].drop_duplicates()))

    # Top 15 Start Station for Subscriber Users 
startstationsubfreq = pd.DataFrame({'count' : CitiBikeDataSampled_5050[CitiBikeDataSampled_5050["usertype"] == 'Subscriber'].groupby(["start_station_name"]).size()}).reset_index().sort_values('count',ascending = False)
TopSubStartStations = startstationsubfreq.head(20)

del startstationsubfreq

    # Top 15 Start Station for Customer Users 
startstationcustfreq = pd.DataFrame({'count' : CitiBikeDataSampled_5050[CitiBikeDataSampled_5050["usertype"] == 'Customer'].groupby(["start_station_name"]).size()}).reset_index().sort_values('count',ascending = False)
TopCustStartStations = startstationcustfreq.head(20)

del startstationcustfreq

    #Concat Subscribers and Customers
TopStartStations = pd.DataFrame(pd.concat([TopSubStartStations,TopCustStartStations])["start_station_name"].drop_duplicates()).reset_index()    
print(len(TopStartStations))
display(TopStartStations[["start_station_name"]])

del TopStartStations
del TopSubStartStations
del TopCustStartStations

    #Split Start Station Values for 50/50 dataset
AttSplit = pd.get_dummies(CitiBikeDataSampled_5050.start_station_name,prefix='start_station_name')

CitiBikeDataSampled_5050 = pd.concat((CitiBikeDataSampled_5050,AttSplit[["start_station_name_Pershing Square N", "start_station_name_E 17 St & Broadway", "start_station_name_8 Ave & W 31 St", "start_station_name_Lafayette St & E 8 St", "start_station_name_W 21 St & 6 Ave", "start_station_name_8 Ave & W 33 St", "start_station_name_W 20 St & 11 Ave", "start_station_name_Broadway & E 14 St", "start_station_name_Broadway & E 22 St", "start_station_name_W 41 St & 8 Ave", "start_station_name_Cleveland Pl & Spring St", "start_station_name_University Pl & E 14 St", "start_station_name_West St & Chambers St", "start_station_name_E 43 St & Vanderbilt Ave", "start_station_name_Broadway & W 24 St", "start_station_name_Greenwich Ave & 8 Ave", "start_station_name_W 18 St & 6 Ave", "start_station_name_Broadway & W 60 St", "start_station_name_Pershing Square S", "start_station_name_W 33 St & 7 Ave", "start_station_name_Central Park S & 6 Ave", "start_station_name_Centre St & Chambers St", "start_station_name_Grand Army Plaza & Central Park S", "start_station_name_Vesey Pl & River Terrace", "start_station_name_Broadway & W 58 St", "start_station_name_West Thames St", "start_station_name_12 Ave & W 40 St", "start_station_name_9 Ave & W 14 St", "start_station_name_W 14 St & The High Line", "start_station_name_State St", "start_station_name_Broadway & Battery Pl"]]),axis=1) # add back into the dataframe

del AttSplit
330
31
start_station_name
0 Pershing Square N
1 E 17 St & Broadway
2 8 Ave & W 31 St
3 Lafayette St & E 8 St
4 W 21 St & 6 Ave
5 8 Ave & W 33 St
6 W 20 St & 11 Ave
7 Broadway & E 14 St
8 Broadway & E 22 St
9 W 41 St & 8 Ave
10 Cleveland Pl & Spring St
11 University Pl & E 14 St
12 West St & Chambers St
13 E 43 St & Vanderbilt Ave
14 Broadway & W 24 St
15 Greenwich Ave & 8 Ave
16 W 18 St & 6 Ave
17 Broadway & W 60 St
18 Pershing Square S
19 W 33 St & 7 Ave
20 Central Park S & 6 Ave
21 Centre St & Chambers St
22 Grand Army Plaza & Central Park S
23 Vesey Pl & River Terrace
24 Broadway & W 58 St
25 West Thames St
26 12 Ave & W 40 St
27 9 Ave & W 14 St
28 W 14 St & The High Line
29 State St
30 Broadway & Battery Pl
Wall time: 2.17 s

End Station Name

Similarly, we have an extremely large quantity of end stations in our dataset (330 stations) and including all of them resulted in system crashes during principal component analysis later in our lab. We were required to reduce this dimension down to a manageable size. Through trial and error on top frequency stations, we have chosen to reduce this number down to ~ 10% its original number. By identifying the top 20 end stations for Subscribers / Customers separately, we found that there were 7 overlapping stations, producing a final list of 33 stations. While encoding our end_station_name integer columns, we limit the number of columns to these stations identified.

In [28]:
%%time
    
    #How many End Stations are there?
print(len(CitiBikeDataSampled_5050["end_station_name"].drop_duplicates()))

    # Top 15 Start Station for Subscriber Users 
endstationsubfreq = pd.DataFrame({'count' : CitiBikeDataSampled_5050[CitiBikeDataSampled_5050["usertype"] == 'Subscriber'].groupby(["end_station_name"]).size()}).reset_index().sort_values('count',ascending = False)
TopSubendStations = endstationsubfreq.head(20)

del endstationsubfreq

    # Top 15 Start Station for Customer Users 
endstationcustfreq = pd.DataFrame({'count' : CitiBikeDataSampled_5050[CitiBikeDataSampled_5050["usertype"] == 'Customer'].groupby(["end_station_name"]).size()}).reset_index().sort_values('count',ascending = False)
TopCustendStations = endstationcustfreq.head(20)

del endstationcustfreq

    #Concat Subscribers and Customers
TopendStations = pd.DataFrame(pd.concat([TopSubendStations,TopCustendStations])["end_station_name"].drop_duplicates()).reset_index()    
print(len(TopendStations))
display(TopendStations[["end_station_name"]])

del TopendStations
del TopSubendStations
del TopCustendStations

    #Split Start Station Values for 50/50 dataset
AttSplit = pd.get_dummies(CitiBikeDataSampled_5050.end_station_name,prefix='end_station_name')

CitiBikeDataSampled_5050 = pd.concat((CitiBikeDataSampled_5050,AttSplit[["end_station_name_E 17 St & Broadway", "end_station_name_Lafayette St & E 8 St", "end_station_name_8 Ave & W 31 St", "end_station_name_W 21 St & 6 Ave", "end_station_name_Pershing Square N", "end_station_name_W 20 St & 11 Ave", "end_station_name_Broadway & E 14 St", "end_station_name_Broadway & E 22 St", "end_station_name_University Pl & E 14 St", "end_station_name_W 41 St & 8 Ave", "end_station_name_West St & Chambers St", "end_station_name_Cleveland Pl & Spring St", "end_station_name_Greenwich Ave & 8 Ave", "end_station_name_E 43 St & Vanderbilt Ave", "end_station_name_Broadway & W 24 St", "end_station_name_W 18 St & 6 Ave", "end_station_name_MacDougal St & Prince St", "end_station_name_Carmine St & 6 Ave", "end_station_name_8 Ave & W 33 St", "end_station_name_2 Ave & E 31 St", "end_station_name_Central Park S & 6 Ave", "end_station_name_Centre St & Chambers St", "end_station_name_Grand Army Plaza & Central Park S", "end_station_name_Broadway & W 60 St", "end_station_name_Broadway & W 58 St", "end_station_name_12 Ave & W 40 St", "end_station_name_Vesey Pl & River Terrace", "end_station_name_W 14 St & The High Line", "end_station_name_9 Ave & W 14 St", "end_station_name_West Thames St", "end_station_name_State St", "end_station_name_Old Fulton St", "end_station_name_South End Ave & Liberty St"]]),axis=1) # add back into the dataframe

del AttSplit
330
33
end_station_name
0 E 17 St & Broadway
1 Lafayette St & E 8 St
2 8 Ave & W 31 St
3 W 21 St & 6 Ave
4 Pershing Square N
5 W 20 St & 11 Ave
6 Broadway & E 14 St
7 Broadway & E 22 St
8 University Pl & E 14 St
9 W 41 St & 8 Ave
10 West St & Chambers St
11 Cleveland Pl & Spring St
12 Greenwich Ave & 8 Ave
13 E 43 St & Vanderbilt Ave
14 Broadway & W 24 St
15 W 18 St & 6 Ave
16 MacDougal St & Prince St
17 Carmine St & 6 Ave
18 8 Ave & W 33 St
19 2 Ave & E 31 St
20 Central Park S & 6 Ave
21 Centre St & Chambers St
22 Grand Army Plaza & Central Park S
23 Broadway & W 60 St
24 Broadway & W 58 St
25 12 Ave & W 40 St
26 Vesey Pl & River Terrace
27 W 14 St & The High Line
28 9 Ave & W 14 St
29 West Thames St
30 State St
31 Old Fulton St
32 South End Ave & Liberty St
Wall time: 1.99 s

Gender, DayOfWeek, and TimeOfDay

The rest of our encoding attributes {Gender, DayOfWeek, and TimeOfDay} have the following value permutations. These permutations will be encoded as individual integer columns as well.

  • Gender: {0 = unknown, 1 = male, 2 = female}
  • DayOfWeek: {Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday}
  • TimeOfDay: {Morning, Midday, Afternoon, Evening, Night}
In [29]:
%%time

    #Split gender Values for 50/50 dataset
AttSplit = pd.get_dummies(CitiBikeDataSampled_5050.gender,prefix='gender')
CitiBikeDataSampled_5050 = pd.concat((CitiBikeDataSampled_5050,AttSplit),axis=1) # add back into the dataframe

del AttSplit

    #Split DayOfWeek Values for 50/50 dataset
AttSplit = pd.get_dummies(CitiBikeDataSampled_5050.DayOfWeek,prefix='DayOfWeek')
CitiBikeDataSampled_5050 = pd.concat((CitiBikeDataSampled_5050,AttSplit),axis=1) # add back into the dataframe

del AttSplit

    #Split TimeOfDay Values for 50/50 dataset
AttSplit = pd.get_dummies(CitiBikeDataSampled_5050.TimeOfDay,prefix='TimeOfDay')
CitiBikeDataSampled_5050 = pd.concat((CitiBikeDataSampled_5050,AttSplit),axis=1) # add back into the dataframe

del AttSplit

display(CitiBikeDataSampled_5050.head())
tripduration starttime stoptime start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bikeid usertype birth year gender LinearDistance DayOfWeek TimeOfDay HolidayFlag PRCP SNOW TAVE TMAX TMIN Age tripdurationLog start_station_name_Pershing Square N start_station_name_E 17 St & Broadway start_station_name_8 Ave & W 31 St start_station_name_Lafayette St & E 8 St start_station_name_W 21 St & 6 Ave start_station_name_8 Ave & W 33 St start_station_name_W 20 St & 11 Ave start_station_name_Broadway & E 14 St start_station_name_Broadway & E 22 St start_station_name_W 41 St & 8 Ave start_station_name_Cleveland Pl & Spring St start_station_name_University Pl & E 14 St start_station_name_West St & Chambers St start_station_name_E 43 St & Vanderbilt Ave start_station_name_Broadway & W 24 St start_station_name_Greenwich Ave & 8 Ave start_station_name_W 18 St & 6 Ave start_station_name_Broadway & W 60 St start_station_name_Pershing Square S start_station_name_W 33 St & 7 Ave start_station_name_Central Park S & 6 Ave start_station_name_Centre St & Chambers St start_station_name_Grand Army Plaza & Central Park S start_station_name_Vesey Pl & River Terrace start_station_name_Broadway & W 58 St start_station_name_West Thames St start_station_name_12 Ave & W 40 St start_station_name_9 Ave & W 14 St start_station_name_W 14 St & The High Line start_station_name_State St start_station_name_Broadway & Battery Pl end_station_name_E 17 St & Broadway end_station_name_Lafayette St & E 8 St end_station_name_8 Ave & W 31 St end_station_name_W 21 St & 6 Ave end_station_name_Pershing Square N end_station_name_W 20 St & 11 Ave end_station_name_Broadway & E 14 St end_station_name_Broadway & E 22 St end_station_name_University Pl & E 14 St end_station_name_W 41 St & 8 Ave end_station_name_West St & Chambers St end_station_name_Cleveland Pl & Spring St end_station_name_Greenwich Ave & 8 Ave end_station_name_E 43 St & Vanderbilt Ave end_station_name_Broadway & W 24 St end_station_name_W 18 St & 6 Ave end_station_name_MacDougal St & Prince St end_station_name_Carmine St & 6 Ave end_station_name_8 Ave & W 33 St end_station_name_2 Ave & E 31 St end_station_name_Central Park S & 6 Ave end_station_name_Centre St & Chambers St end_station_name_Grand Army Plaza & Central Park S end_station_name_Broadway & W 60 St end_station_name_Broadway & W 58 St end_station_name_12 Ave & W 40 St end_station_name_Vesey Pl & River Terrace end_station_name_W 14 St & The High Line end_station_name_9 Ave & W 14 St end_station_name_West Thames St end_station_name_State St end_station_name_Old Fulton St end_station_name_South End Ave & Liberty St gender_0 gender_1 gender_2 DayOfWeek_Friday DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday TimeOfDay_Afternoon TimeOfDay_Evening TimeOfDay_Midday TimeOfDay_Morning TimeOfDay_Night
178325 2023 2013-07-07 14:25:39 2013-07-07 14:59:22 245 Myrtle Ave & St Edwards St 40.693270 -73.977039 539 Metropolitan Ave & Bedford Ave 40.715348 -73.960241 17767 Customer 0 0 1.760372 Sunday Afternoon 0 0.00 0.0 85 92 78 0 7.612337 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0
112533 383 2013-10-09 20:27:55 2013-10-09 20:34:18 334 W 20 St & 7 Ave 40.742388 -73.997262 509 9 Ave & W 22 St 40.745497 -74.001971 18523 Customer 0 0 0.327278 Wednesday Evening 0 0.00 0.0 58 62 53 0 5.948035 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0
199683 634 2013-09-01 17:40:41 2013-09-01 17:51:15 363 West Thames St 40.708347 -74.017134 3002 South End Ave & Liberty St 40.711512 -74.015756 20173 Customer 0 0 0.230075 Sunday Evening 0 0.00 0.0 80 84 75 0 6.452049 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0
53820 330 2013-07-01 17:18:13 2013-07-01 17:23:43 352 W 56 St & 6 Ave 40.763406 -73.977225 456 E 53 St & Madison Ave 40.759711 -73.974023 17872 Customer 0 0 0.305348 Monday Evening 0 0.84 0.0 75 77 72 0 5.799093 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0
155516 186 2013-08-03 12:18:56 2013-08-03 12:22:02 540 Lexington Ave & E 26 St 40.741473 -73.983209 540 Lexington Ave & E 26 St 40.741473 -73.983209 19703 Customer 0 0 0.000000 Saturday Midday 0 0.06 0.0 73 78 68 0 5.225747 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0
Wall time: 3.27 s

With these encodings complete, our final dataset to cross-validate on test/train datasets would appear to be complete. However, given the large number of attributes now present in our dataset, it would be wise to investigate a means of dimensionality reduction to not only speed up model generation, but to also improve accuracy by removing variable redundancy and correlation.

Data Set Summary

At this stage, we've converted our original 30 variables into 107 attributes after creating dummy variables for categorical data such as day of the week, time of day, station names, gender, etc. These 107 attributes and their data types are as follows:

In [30]:
%%time

data_type = []
for idx, col in enumerate(CitiBikeDataSampled_5050.columns):
    data_type.append(CitiBikeDataSampled_5050.dtypes[idx])

summary_df = {'Attribute Name' : pd.Series(CitiBikeDataSampled_5050.columns, index = range(len(CitiBikeDataSampled_5050.columns))), 'Data Type' : pd.Series(data_type, index = range(len(CitiBikeDataSampled_5050.columns)))}
summary_df = pd.DataFrame(summary_df)
display(summary_df)

del data_type, summary_df
Attribute Name Data Type
0 tripduration int64
1 starttime datetime64[ns]
2 stoptime datetime64[ns]
3 start_station_id object
4 start_station_name object
5 start_station_latitude float64
6 start_station_longitude float64
7 end_station_id object
8 end_station_name object
9 end_station_latitude float64
10 end_station_longitude float64
11 bikeid object
12 usertype object
13 birth year int64
14 gender object
15 LinearDistance float64
16 DayOfWeek object
17 TimeOfDay object
18 HolidayFlag object
19 PRCP float64
20 SNOW float64
21 TAVE int64
22 TMAX int64
23 TMIN int64
24 Age int64
25 tripdurationLog float64
26 start_station_name_Pershing Square N uint8
27 start_station_name_E 17 St & Broadway uint8
28 start_station_name_8 Ave & W 31 St uint8
29 start_station_name_Lafayette St & E 8 St uint8
30 start_station_name_W 21 St & 6 Ave uint8
31 start_station_name_8 Ave & W 33 St uint8
32 start_station_name_W 20 St & 11 Ave uint8
33 start_station_name_Broadway & E 14 St uint8
34 start_station_name_Broadway & E 22 St uint8
35 start_station_name_W 41 St & 8 Ave uint8
36 start_station_name_Cleveland Pl & Spring St uint8
37 start_station_name_University Pl & E 14 St uint8
38 start_station_name_West St & Chambers St uint8
39 start_station_name_E 43 St & Vanderbilt Ave uint8
40 start_station_name_Broadway & W 24 St uint8
41 start_station_name_Greenwich Ave & 8 Ave uint8
42 start_station_name_W 18 St & 6 Ave uint8
43 start_station_name_Broadway & W 60 St uint8
44 start_station_name_Pershing Square S uint8
45 start_station_name_W 33 St & 7 Ave uint8
46 start_station_name_Central Park S & 6 Ave uint8
47 start_station_name_Centre St & Chambers St uint8
48 start_station_name_Grand Army Plaza & Central ... uint8
49 start_station_name_Vesey Pl & River Terrace uint8
50 start_station_name_Broadway & W 58 St uint8
51 start_station_name_West Thames St uint8
52 start_station_name_12 Ave & W 40 St uint8
53 start_station_name_9 Ave & W 14 St uint8
54 start_station_name_W 14 St & The High Line uint8
55 start_station_name_State St uint8
56 start_station_name_Broadway & Battery Pl uint8
57 end_station_name_E 17 St & Broadway uint8
58 end_station_name_Lafayette St & E 8 St uint8
59 end_station_name_8 Ave & W 31 St uint8
60 end_station_name_W 21 St & 6 Ave uint8
61 end_station_name_Pershing Square N uint8
62 end_station_name_W 20 St & 11 Ave uint8
63 end_station_name_Broadway & E 14 St uint8
64 end_station_name_Broadway & E 22 St uint8
65 end_station_name_University Pl & E 14 St uint8
66 end_station_name_W 41 St & 8 Ave uint8
67 end_station_name_West St & Chambers St uint8
68 end_station_name_Cleveland Pl & Spring St uint8
69 end_station_name_Greenwich Ave & 8 Ave uint8
70 end_station_name_E 43 St & Vanderbilt Ave uint8
71 end_station_name_Broadway & W 24 St uint8
72 end_station_name_W 18 St & 6 Ave uint8
73 end_station_name_MacDougal St & Prince St uint8
74 end_station_name_Carmine St & 6 Ave uint8
75 end_station_name_8 Ave & W 33 St uint8
76 end_station_name_2 Ave & E 31 St uint8
77 end_station_name_Central Park S & 6 Ave uint8
78 end_station_name_Centre St & Chambers St uint8
79 end_station_name_Grand Army Plaza & Central Pa... uint8
80 end_station_name_Broadway & W 60 St uint8
81 end_station_name_Broadway & W 58 St uint8
82 end_station_name_12 Ave & W 40 St uint8
83 end_station_name_Vesey Pl & River Terrace uint8
84 end_station_name_W 14 St & The High Line uint8
85 end_station_name_9 Ave & W 14 St uint8
86 end_station_name_West Thames St uint8
87 end_station_name_State St uint8
88 end_station_name_Old Fulton St uint8
89 end_station_name_South End Ave & Liberty St uint8
90 gender_0 uint8
91 gender_1 uint8
92 gender_2 uint8
93 DayOfWeek_Friday uint8
94 DayOfWeek_Monday uint8
95 DayOfWeek_Saturday uint8
96 DayOfWeek_Sunday uint8
97 DayOfWeek_Thursday uint8
98 DayOfWeek_Tuesday uint8
99 DayOfWeek_Wednesday uint8
100 TimeOfDay_Afternoon uint8
101 TimeOfDay_Evening uint8
102 TimeOfDay_Midday uint8
103 TimeOfDay_Morning uint8
104 TimeOfDay_Night uint8
Wall time: 276 ms

Defining Cluster Attributes

Because our stratified, processed data is comprised of 105 various attributes ranging from weather and distance data to location and user type data across all 500,000 sample observations, and some variables such as weather attributes and even some start and end stations correlate to one another, we felt it would be wise during our previous Lab 2 analysis to reduce our number of attributes considered during Customer/Subscriber prediction. We proceeded to use Principal Component Analysis (PCA) to reduce the dimensionality of our dataset.

Furthermore, during analysis of our principal components' loadings, we identified only 22 of the originally processed 105 attributes as being contextually important in identifying Customers and Subscribers. As the intent of our Lab 3 analysis is to further identify Customer users that should be Subscribers based on their behaviour, we deem it wise to only use these 22 attributes while clustering as well.

These attributes are as follows:

  • start_station_latitude
  • start_station_longitude
  • end_station_latitude
  • end_station_longitude
  • PRCP
  • SNOW
  • TAVE
  • TMAX
  • TMIN
  • DayOfWeek_Friday
  • DayOfWeek_Monday
  • DayOfWeek_Saturday
  • DayOfWeek_Sunday
  • DayOfWeek_Thursday
  • DayOfWeek_Tuesday
  • DayOfWeek_Wednesday
  • TimeOfDay_Afternoon
  • TimeOfDay_Evening
  • TimeOfDay_Midday
  • TimeOfDay_Morning
  • TimeOfDay_Night
  • tripdurationLog

In addition to using only these attributes while clustering, we chose to split our stratified sample data set of 500,000 transactions into separate Customer and Subscriber data sets while clustering. This will provide us the advantage of identifying clusters based on Customer data separately from Subscriber data - the advantage being that further granularity will be offered into the user sub-groups (based on transaction details) that comprise each user class. When later comparing these clusterings between each user class, we will be able to further classify one user type's data against the opposite user type's cluster IDs. This, and its implementation, will be described in much greater detail in the Deployment section. Currently, it suffices to say that clustering against each known user type is a necessary means of identifying would-be Subscribers.

In [31]:
%%time

# Subset data set to only variables identified to have the greatest PCA loadings
attr_clus = CitiBikeDataSampled_5050[['start_station_latitude',
                                      'start_station_longitude',
                                      'end_station_latitude',
                                      'end_station_longitude',
                                      'PRCP',
                                      'SNOW',
                                      'TAVE',
                                      'TMAX',
                                      'TMIN',
                                      'DayOfWeek_Friday',
                                      'DayOfWeek_Monday',
                                      'DayOfWeek_Saturday',
                                      'DayOfWeek_Sunday',
                                      'DayOfWeek_Thursday',
                                      'DayOfWeek_Tuesday',
                                      'DayOfWeek_Wednesday',
                                      'TimeOfDay_Afternoon',
                                      'TimeOfDay_Evening',
                                      'TimeOfDay_Midday',
                                      'TimeOfDay_Morning',
                                      'TimeOfDay_Night',
                                      'tripdurationLog',
                                      'usertype']]

attr_scaled = attr_clus.ix[:,0:(len(attr_clus.columns)-1)] #Remove usertype from scaled columns
scaler = StandardScaler().fit(attr_scaled)
CitiBike_clus = scaler.transform(attr_scaled)

CitiBike_clus = pd.DataFrame(CitiBike_clus)
users = pd.DataFrame(CitiBikeDataSampled_5050.usertype)
CitiBike_clus = pd.concat([CitiBike_clus.reset_index(), users.reset_index()], axis = 1)
del CitiBike_clus['index']
CitiBike_clus.columns = attr_clus.columns

CitiBike_C = CitiBike_clus.loc[CitiBike_clus['usertype'] == 'Customer']
CitiBike_S = CitiBike_clus.loc[CitiBike_clus['usertype'] == 'Subscriber']

#min_max_scaler = MinMaxScaler()
#scaled = min_max_scaler.fit_transform(CitiBike_C.ix[:,0:(len(CitiBike_C.columns)-1)])
#cols = CitiBike_C.ix[:,0:(len(CitiBike_C.columns)-1)].columns
#CitiBike_C = pd.DataFrame(scaled, columns=cols)
#
#min_max_scaler = MinMaxScaler()
#scaled = min_max_scaler.fit_transform(CitiBike_S.ix[:,0:(len(CitiBike_S.columns)-1)])
#cols = CitiBike_S.ix[:,0:(len(CitiBike_S.columns)-1)].columns
#CitiBike_S = pd.DataFrame(scaled, columns=cols)
#
print('Customer data dimensions =', CitiBike_C.shape)
print('Subscriber data dimensions =',CitiBike_S.shape)
Customer data dimensions = (250000, 23)
Subscriber data dimensions = (250000, 23)
Wall time: 1.35 s

In addition to separting the data based on user type, we also scaled the data values to remove bias while clustering due to different attribute scales. Without scaling the data, attributes such as station coordinates and trip duration would carry heavier weights when compared against the OneHotEncoded attributes and precipitation data. This would cause unbalanced and improperly clustered groups. The first 5 standardized ride transactions are shown below for Customers and Subscribers as an example of what the data looks like after scaling.

In [32]:
display(CitiBike_C.head())
display(CitiBike_S.head())
start_station_latitude start_station_longitude end_station_latitude end_station_longitude PRCP SNOW TAVE TMAX TMIN DayOfWeek_Friday DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday TimeOfDay_Afternoon TimeOfDay_Evening TimeOfDay_Midday TimeOfDay_Morning TimeOfDay_Night tripdurationLog usertype
0 -1.945869 1.118504 -0.866803 2.396808 -0.305915 -0.069606 1.285221 1.258632 1.319126 -0.399214 -0.385424 -0.472244 2.267759 -0.381894 -0.385824 -0.388611 1.845237 -0.694351 -0.544202 -0.407459 -0.288457 1.187103 Customer
1 0.427429 -0.453045 0.585415 -0.808549 -0.305915 -0.069606 -0.464291 -0.631718 -0.316163 -0.399214 -0.385424 -0.472244 -0.440964 -0.381894 -0.385824 2.573270 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 -0.966816 Customer
2 -1.217379 -1.997279 -1.051587 -1.867364 -0.305915 -0.069606 0.961237 0.754539 1.122892 -0.399214 -0.385424 -0.472244 2.267759 -0.381894 -0.385824 -0.388611 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 -0.314528 Customer
3 1.443002 1.104047 1.270051 1.338197 4.087822 -0.069606 0.637254 0.313457 0.926657 -0.399214 2.594548 -0.472244 -0.440964 -0.381894 -0.385824 -0.388611 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 -1.159576 Customer
4 0.383217 0.639000 0.391566 0.632595 0.007923 -0.069606 0.507660 0.376469 0.665011 -0.399214 -0.385424 2.117550 -0.440964 -0.381894 -0.385824 -0.388611 -0.541936 -0.694351 1.837552 -0.407459 -0.288457 -1.901593 Customer
start_station_latitude start_station_longitude end_station_latitude end_station_longitude PRCP SNOW TAVE TMAX TMIN DayOfWeek_Friday DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday TimeOfDay_Afternoon TimeOfDay_Evening TimeOfDay_Midday TimeOfDay_Morning TimeOfDay_Night tripdurationLog usertype
250000 0.086126 -0.510692 0.273268 -0.859722 -0.148996 -0.069606 -0.010714 0.061411 -0.119928 -0.399214 2.594548 -0.472244 -0.440964 -0.381894 -0.385824 -0.388611 -0.541936 -0.694351 -0.544202 -0.407459 3.466716 -1.508571 Subscriber
250001 0.886272 1.066991 0.273268 -0.859722 -0.305915 -0.069606 -0.334697 -0.442683 -0.250751 -0.399214 -0.385424 -0.472244 -0.440964 2.618529 -0.385824 -0.388611 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 0.120055 Subscriber
250002 -0.260767 0.094673 0.100944 -0.237693 -0.305915 -0.069606 -2.472990 -2.648092 -2.278510 -0.399214 -0.385424 -0.472244 -0.440964 2.618529 -0.385824 -0.388611 -0.541936 -0.694351 -0.544202 2.454233 -0.288457 -1.349308 Subscriber
250003 1.140597 1.503969 -0.147250 0.052253 -0.305915 -0.069606 0.442864 0.250446 0.665011 -0.399214 -0.385424 -0.472244 2.267759 -0.381894 -0.385824 -0.388611 -0.541936 -0.694351 1.837552 -0.407459 -0.288457 -1.638098 Subscriber
250004 0.143032 1.036446 0.333087 0.564781 -0.305915 -0.069606 -0.205104 -0.001601 -0.446986 -0.399214 -0.385424 -0.472244 -0.440964 -0.381894 -0.385824 2.573270 -0.541936 -0.694351 -0.544202 -0.407459 3.466716 -1.349308 Subscriber

Modeling and Evaluation Part 1 - Train and adjust parameters

K-Means Clustering

With our attributes identified and our datasets segregated into customers and subscribers we began with our first clustering algorithm: Kmeans. With 22 attributes and 250,000 observations in each dataset, KMeans stands to potentially be a good fit. Our first step is in identifying the number of clusters to use. Because our dataset isn't entirely quantitative, we based our decision on the silhouette scores for clusters rather than on pure accuracy scores. Unfortunately, silhouette scores is resource intensive so we were forced to sample each dataset to reach a point where we could consistently run the silhouette graphing process. We sampled for 25000 observations (or about 10%) with replacement.

With a sample size decided upon, we ran the dataset through mulitple iterations of silhouette analysis using an incremental series of cluster sizes from 2 clusters to 30 (going up to 70 with the subscriber dataset). Each iteration was a different seed, letting us pull a range of random samples to let us see the general idea of which cluster sizes would yield the best results. Below are the series with the best silhouette scores in comparison to their silhouette graphs.

Subscriber Kmeans Silhouette Scores

In [33]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

kmeans_S = CitiBike_S.sample(n=25000, replace=True, random_state=10101)
kmeans_y = kmeans_S['usertype']
kmeans_X = kmeans_S[["start_station_latitude","start_station_longitude","end_station_latitude","end_station_longitude","PRCP","SNOW","TAVE","TMAX","TMIN","DayOfWeek_Friday","DayOfWeek_Monday","DayOfWeek_Saturday","DayOfWeek_Sunday","DayOfWeek_Thursday","DayOfWeek_Tuesday","DayOfWeek_Wednesday","TimeOfDay_Afternoon","TimeOfDay_Evening","TimeOfDay_Midday","TimeOfDay_Morning","TimeOfDay_Night","tripdurationLog"]]

# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.

#For testing and finding number of clusters
#range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
#range_n_clusters = [51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70]
#Did see equivalent averages at around n clusters 25 to 30, but that was due to extreme spikes due to small cluster size
range_n_clusters = [9, 10, 11, 12, 13]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(kmeans_X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(kmeans_X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(kmeans_X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(kmeans_X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(kmeans_X["start_station_longitude"], kmeans_X["start_station_latitude"], marker='.', s=30, lw=0, alpha=0.7,
                c=colors)

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1],
                marker='o', c="white", alpha=1, s=200)

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

    plt.show()

For n_clusters = 9 The average silhouette_score is : 0.199781516478 For n_clusters = 10 The average silhouette_score is : 0.196131086326 For n_clusters = 11 The average silhouette_score is : 0.203819623072 For n_clusters = 12 The average silhouette_score is : 0.201432376304 For n_clusters = 13 The average silhouette_score is : 0.187475393763

The subscriber dataset consistently had at least one cluster containing a negative silhouette score range. Further analysis of the dataset would be necessary to identify why and how this collection of observations would have such low scores, so we were forced to select a dataset that had a high average silhouette score with as low a negative range as possible. Above are the top five cluster sizes based upon this criteria. Ultimately we decided to go with 11 clusters.

It should be noted that we also explored cluster sizes above 50. While we were eventually able to see an average score of above 0.24, it was mainly due to narrow cluster sizes and high score peaks. Once above a cluster size of 15, while the score might improve, multiple clusters would have negative ranges with extreme peaks.

Customer Kmeans Silhouette Scores

In [34]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

kmeans_C = CitiBike_C.sample(n=25000, replace=True, random_state=10101)
kmeans_y = kmeans_C['usertype']
kmeans_X = kmeans_C[["start_station_latitude","start_station_longitude","end_station_latitude","end_station_longitude","PRCP","SNOW","TAVE","TMAX","TMIN","DayOfWeek_Friday","DayOfWeek_Monday","DayOfWeek_Saturday","DayOfWeek_Sunday","DayOfWeek_Thursday","DayOfWeek_Tuesday","DayOfWeek_Wednesday","TimeOfDay_Afternoon","TimeOfDay_Evening","TimeOfDay_Midday","TimeOfDay_Morning","TimeOfDay_Night","tripdurationLog"]]

# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.

#For testing and finding number of clusters
#range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
#Did see equivalent averages at around n clusters 25 to 30, but that was due to extreme spikes due to small cluster size
range_n_clusters = [6, 7, 8, 9, 10]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(kmeans_X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(kmeans_X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(kmeans_X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(kmeans_X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(kmeans_X["start_station_longitude"], kmeans_X["start_station_latitude"], marker='.', s=30, lw=0, alpha=0.7,
                c=colors)

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1],
                marker='o', c="white", alpha=1, s=200)

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

    plt.show()

For n_clusters = 6 The average silhouette_score is : 0.196749844481 For n_clusters = 7 The average silhouette_score is : 0.221545306167 For n_clusters = 8 The average silhouette_score is : 0.177520274085 For n_clusters = 9 The average silhouette_score is : 0.183981689115 For n_clusters = 10 The average silhouette_score is : 0.179497780886

Again, like with the subscriber dataset, we looked at a range of clusters for customers. What was perhaps surprising with this dataset is that, despite the same features, we needed fewer clusters to represent the data. Similarly though, higher cluster counts did result in higher silhouette scores, but at less uniform cluster sizes, higher peaks and more negative ranges. Ultimately we settled on 7 clusters for our customer dataset.

DBSCAN

Our next approach to clustering the CitiBike user data is to implement DBSCAN on the 22 attributes selected for clustering. However, initial clustering attempts while including longitude and latitude data resulted in processing failure due to memory errors. For this reason, we have chosen to remove start and end station coordinate data from the attributes identified for clustering.

While this differs from our other approaches to clustering in this lab, treating coordinates as such is a necessary action given the processing problems encounted on our workstations. Our theory is that it is in part due to having so many transactions with the exact same geo-coordinate values. One solution might be to OneHotEncode coordinate values, treating them as categorical data. The downside to this is that it would significantly increase the number of attributes in our data set. Given the problem and scope of this analysis, we've opted for their removal instead.

In [35]:
#CitiBike_dbscan = CitiBike_clus[['start_station_latitude','start_station_longitude']].drop_duplicates()
#CitiBike_dbscan = CitiBike_clus[['start_station_latitude','start_station_longitude']]
C_dbscan = CitiBike_C.ix[:,4:22]
S_dbscan = CitiBike_S.ix[:,4:22]
CitiBike_loc = CitiBike_clus[['start_station_latitude','start_station_longitude']].drop_duplicates()
print('Customer dbscan data dimensions =', C_dbscan.shape)
print('Subscriber dbscan data dimensions =', S_dbscan.shape)
Customer dbscan data dimensions = (250000, 18)
Subscriber dbscan data dimensions = (250000, 18)

Coordinate removal aside, we do have some reservations about implementing DBSCAN clustering, period, due to the density method being based on Euclidean distance measures. Even with 18 attributes instead of 22, the high dimensionality within our Customer and Subscriber data sets is expected to minimize the effectiveness of using distance as the primary measure. Nevertheless, such an attempt is still worth while in order to compare the results against our other clustering methods.

Of course, another expected limitation is that of size. DBSCAN is not very efficient with our large data set since a full pairwise similarity matrix must be constructed, amounting to $n^2$ computations each time the algorithm is run (See Sci-Kit Learn's description). So DBSCAN is expected to be slow to implement, making parameter adjustments difficult and time consuming.

Now moving on with the analysis, we will later plot the cluster IDs against the coordinates in order to gain better insight into which cluster ID's appear at each station (Again, coordinate attributes are removed from the DBSCAN cluster data... plotting against coordinates is for marketing application purposes only as will be described in the deployment section). The initial coordinates, without clustering, are depicted in the following scatterplot.

In [36]:
%%time
plt.figure(figsize=(15,5))
#plt.plot(CitiBike_dbscan.start_station_longitude, CitiBike_dbscan.start_station_latitude, 'bo', markersize=3) #plot the data
plt.plot(CitiBike_loc.start_station_longitude, CitiBike_loc.start_station_latitude, 'bo', markersize=4, markerfacecolor='b') #plot the data
plt.title('Latitude/Longitude Data'.format(2))
plt.xlabel('Scaled Longitude Coordinate'.format(2))
plt.ylabel('Scaled Latitude Coordinate'.format(2))
plt.xlim(-10,10)
plt.ylim(-3,2.5)
plt.ticklabel_format(style='plain', axis='x')
plt.grid()
plt.ticklabel_format(useOffset=False)

plt.show()
Wall time: 209 ms

As mentioned previously, customer and subscriber data will be clustered separately in order to help identify customers that should be subscribers but aren't, as will be discussed in greater detail in the Deployment section.

Below are functions that will be utilized throughout the DBSCAN clustering process. The getGraph function simply generates a k-neighbors graph of the data in order to plot potential eps values based on a prescribed minimum number of samples to be used in the DBSCAN algorithm. Next, the epsPlot function plots these potential eps values. Finally, our dbs function clusters the data based on prescribed eps and minimum number of samples parameters. Because we plan to plot the clusters against station locations, we adopted and modified a couple functions from this Stackoverflow post to jitter our data points while depicting a scatterplot. These functions are provided below as well.

In [37]:
%%time

def rand_jitter(arr):
    stdev = arr.max()/100.
    return arr + np.random.randn(len(arr)) * stdev

def jitter(x, y, s=20, c='b', marker='o', cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, hold=None, **kwargs):
    return plt.scatter(rand_jitter(x), rand_jitter(y), s=s, c=c, marker=marker, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, hold=None, **kwargs)

def getGraph(N, data):
    graph = kneighbors_graph(data, n_neighbors = N, mode='distance') # calculate distance to nearest neighbors
    #CitiBike_knn_graph = kneighbors_graph(CitiBike_clus, n_neighbors = N, mode='distance') # calculate distance to nearest neighbors
    
    return graph

def epsPlot(N, graph):
    N1 = graph.shape[0]
    CitiBike_distances = np.zeros((N1,1))
    for i in range(N1):
        CitiBike_distances[i] = graph[i,:].max()

    CitiBike_distances = np.sort(CitiBike_distances, axis=0)

    plt.figure(figsize=(15,5))
    #plt.subplot(1,2,1)
    plt.plot(range(N1), CitiBike_distances, 'r.', markersize=4) #plot the data
    plt.title('Dataset name: CitiBike_clus, sorted by neighbor distance')
    plt.xlabel('CitiBike_clus, Instance Number')
    plt.ylabel('CitiBike_clus, Distance to {0}th nearest neighbor'.format(N))
    #plt.xlim([400000,500000])
    #plt.plot([0, 350], [0.0054, 0.0054], 'k--', lw=0.5)
    #plt.grid()
    
def dbs(eps, minpts, data):
    #db = DBSCAN(eps=eps, min_samples=minpts).fit(CitiBike_dbscan)
    db = DBSCAN(eps=eps, min_samples=minpts, n_jobs=-1).fit(data)
    #db = DBSCAN(eps=eps, min_samples=minpts, metric='cosine', algorithm='brute').fit(CitiBike_clus)
    labels = db.labels_

    # Number of clusters in labels, ignoring noise if present.
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

    # mark the samples that are considered "core"
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True

    plt.figure(figsize=(15,4))
    unique_labels = set(labels) # the unique labels
    colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
    
    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Black used for noise.
            col = 'k'

        class_member_mask = (labels == k)
        
        if(data.equals(C_dbscan)):
            xy = CitiBike_C[class_member_mask & core_samples_mask]
        elif(data.equals(S_dbscan)):
            xy = CitiBike_S[class_member_mask & core_samples_mask]
        # plot the core points in this class
        #plt.plot(xy.start_station_longitude, xy.start_station_latitude, '.', markerfacecolor=col,
        #         markeredgecolor='w', markersize=6)
        jitter(xy.start_station_longitude, xy.start_station_latitude, c=col, s=10)

        # plot the remaining points that are edge points
        if(data.equals(C_dbscan)):
            xy = CitiBike_C[class_member_mask & ~core_samples_mask]
        elif(data.equals(S_dbscan)):
            xy = CitiBike_S[class_member_mask & ~core_samples_mask]
        #plt.plot(xy.start_station_longitude, xy.start_station_latitude, '.', markerfacecolor=col,
        #         markeredgecolor='w', markersize=3)
        jitter(xy.start_station_longitude, xy.start_station_latitude, c=col, s=10)

    plt.title('Estimated number of clusters: %d' % n_clusters_)
    plt.xlim(-10,10)
    plt.ylim(-3,2.5)
    plt.grid()
    plt.ticklabel_format(useOffset=False)
    plt.show()
    
    return(db)
Wall time: 0 ns
Customer DBSCAN

Before running the DBSCAN algorithm on our customer data set, we would first like to obtain a basic understanding of what types of eps and minimum number of samples to use. Due to the multi-dimensionality of our data, and the fact that we have 250,000 observations in our customer data set, it is difficult to define an appropriate mininum number of samples through visual inspection of the data. Therefore, we have chosen a preliminary sample count of 200 from which we will identify an eps value to use as our initial starting point.

The eps plot below is configured such that customer transaction instances are ordered from least to greatest distance to their 200th nearest neighbor. To determine a good starting point, we identify a point where the distance starts to increase (looking for an inflection point upward). This is a bit challenging given the plotted results, but, again, this is only a starting point for our permutations. The arguable point identified in this case is at an eps value of about 0.5. This will be our starting value used in our first customer DBSCAN permutation.

In [38]:
%%time
N = 200
CitiBike_knn_graph = getGraph(N, C_dbscan)
epsPlot(N, CitiBike_knn_graph)
plt.annotate('Expected Eps value = approx. 0.5', xy=(225000, 0.7), xytext=(150000, 10),
                arrowprops=dict(facecolor='black', shrink=0.05),)
plt.show()
Wall time: 25min 33s

As mentioned previously, a minimal core sample count of 200 was selected rather arbitrarily since it is very difficult to identify a good starting point visually. Given that our customer data contains 250,000 transactions and we'd like to obtain DBSCAN results producing no more than ten to twenty clusters at most, 200 seemed like a decent starting point with the correct eps value. After a few DBSCAN runs, we decided to adjust this number based on results produced when adjusting the eps value.

DBSCAN's inefficiencies with large data sets were quickly exemplified as we performed the following permutations:

  • Configuration 1: eps = 0.5, minpts = 200 ...... Wall time = 31min 16s
  • Configuration 2: eps = 0.7, minpts = 200 ...... Wall time = 35min 22s
  • Configuration 3: eps = 0.9, minpts = 200 ...... Wall time = 42min 47s
  • Configuration 4: eps = 1.0, minpts = 200 ...... Wall time = 45min 26s
  • Configuration 5: eps = 1.0, minpts = 250 ...... Wall time = 46min 55s
  • Configuration 6: eps = 1.0, minpts = 300 ...... Wall time = 47min 47s
  • Configuration 7: eps = 1.2, minpts = 200 ...... Wall time = 1h 14min 53s
  • Configuration 8: eps = 2.0, minpts = 200 ...... Wall time = 3h 39min 30s

These permutations are performed below and their outputs pickled in order to save time each time our analysis code is run though again. The concept of pickling our results will be discussed further in the Deployment section.

Cluster visualizations will be provided and discussed in the Modeling and Evaluation Part 3 section later in this writeup and, therefore, are not provided immediately below.

In [39]:
%%time
if os.path.isfile("PickleFiles/db_c.pkl"):
    print("Found the File!")
    db_c = unpickleObject("db_c")
else: db_c = dbs(0.5, 200, C_dbscan)
Found the File!
Wall time: 359 ms
In [40]:
%%time
if os.path.isfile("PickleFiles/db_c1.pkl"):
    print("Found the File!")
    db_c1 = unpickleObject("db_c1")
else: db_c1 = dbs(0.7, 200, C_dbscan)
Found the File!
Wall time: 388 ms
In [41]:
%%time
if os.path.isfile("PickleFiles/db_c2.pkl"):
    print("Found the File!")
    db_c2 = unpickleObject("db_c2")
else: db_c2 = dbs(0.9, 200, C_dbscan)
Found the File!
Wall time: 447 ms
In [42]:
%%time
if os.path.isfile("PickleFiles/db_c3.pkl"):
    print("Found the File!")
    db_c3 = unpickleObject("db_c3")
else: db_c3 = dbs(1.0, 200, C_dbscan)
Found the File!
Wall time: 510 ms
In [43]:
%%time
if os.path.isfile("PickleFiles/db_c4.pkl"):
    print("Found the File!")
    db_c4 = unpickleObject("db_c4")
else: db_c4 = dbs(1.0, 250, C_dbscan)
Found the File!
Wall time: 419 ms
In [44]:
%%time
if os.path.isfile("PickleFiles/db_c5.pkl"):
    print("Found the File!")
    db_c5 = unpickleObject("db_c5")
else: db_c5 = dbs(1.0, 300, C_dbscan)
Found the File!
Wall time: 427 ms
In [45]:
%%time
if os.path.isfile("PickleFiles/db_c6.pkl"):
    print("Found the File!")
    db_c6 = unpickleObject("db_c6")
else: db_c6 = dbs(1.2, 200, C_dbscan)
Found the File!
Wall time: 469 ms
In [46]:
%%time
if os.path.isfile("PickleFiles/db_c7.pkl"):
    print("Found the File!")
    db_c7 = unpickleObject("db_c7")
else: db_c7 = dbs(2.0, 200, C_dbscan)
Found the File!
Wall time: 488 ms
In [47]:
%%time
if os.path.isfile("PickleFiles/db_c.pkl"):
    print("Found the File!")
else: pickleObject(db_c, "db_c", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_c1.pkl"):
    print("Found the File!")
else: pickleObject(db_c1, "db_c1", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_c2.pkl"):
    print("Found the File!")
else: pickleObject(db_c2, "db_c2", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_c3.pkl"):
    print("Found the File!")
else: pickleObject(db_c3, "db_c3", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_c4.pkl"):
    print("Found the File!")
else: pickleObject(db_c4, "db_c4", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_c5.pkl"):
    print("Found the File!")
else: pickleObject(db_c5, "db_c5", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_c6.pkl"):
    print("Found the File!")
else: pickleObject(db_c6, "db_c6", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_c7.pkl"):
    print("Found the File!")
else: pickleObject(db_c7, "db_c7", filepath = "PickleFiles/")
Found the File!
Found the File!
Found the File!
Found the File!
Found the File!
Found the File!
Found the File!
Found the File!
Wall time: 2.01 ms
Subscriber DBSCAN

Similar to what was done for our customer data above, we again chose a minimum core sample count of 200 as a preliminary starting point for subscriber data, not knowing exactly how many samples to define given our 22 dimension data set. We fully expect the results of the following eps plot be different from the customer data's plot because, as has been clearly identified in our previous labs and visually earlier in this lab, customers and subscribers behave differently. Therefore we expect them to cluster differently as well.

The subscriber eps plot below is configured as was the plot for customer data above.The arguable point identified in this case is at an eps value of about 1.0. This will be our starting value used in our first subscriber DBSCAN permutation.

In [48]:
%%time
N = 200
CitiBike_knn_graph = getGraph(N, S_dbscan)
epsPlot(N, CitiBike_knn_graph)
plt.annotate('Expected Eps value = approx. 1.0', xy=(225000, 1.0), xytext=(150000, 5.5),
                arrowprops=dict(facecolor='black', shrink=0.05),)
plt.show()
Wall time: 22min 21s

With our starting points identified, we again planned to conduct 8 different permutations to narrow down to the best subscriber DBSCAN configuration. These configurations' parameters are outlined below along with their final run times (Note that subscriber permutations were run on a different workstation than the customer permutations to make efficient use of work time. So, these times are not meant to be an apples-to-apples comparison against customer times.):

  • Configuration 1: eps = 1.0, minpts = 200 ...... Wall time = 31min 48s
  • Configuration 2: eps = 0.8, minpts = 200 ...... Wall time = 25min 39s
  • Configuration 3: eps = 0.5, minpts = 200 ...... Wall time = 18min 13s
  • Configuration 4: eps = 0.5, minpts = 150 ...... Wall time = 18min 16s
  • Configuration 5: eps = 0.5, minpts = 250 ...... Wall time = 18min 21s
  • Configuration 6: eps = 0.5, minpts = 300 ...... Wall time = 18min 11s
  • Configuration 7: eps = 0.5, minpts = 350 ...... Wall time = 18min 17s
  • Configuration 8: eps = 0.5, minpts = 450 ...... Wall time = 18min 6s

After reviewing results for these eight permutations, we decided it was worth also attempting two more permutations as outlined below:

  • Configuration 9: eps = 1.0, minpts = 450 ...... Wall time = 30min 31s
  • Configuration 10: eps = 1.2, minpts = 450 ...... Wall time = 35min 18s

Again, these permutations are performed below but their outputs are pickled in order to save time with each run of our code. Cluster visualizations will be provided and discussed in the Modeling and Evaluation Part 3 section later in this writeup and, therefore, are not provided immediately below.

In [49]:
%%time
if os.path.isfile("PickleFiles/db_s.pkl"):
    print("Found the File!")
    db_s = unpickleObject("db_s")
else: db_s = dbs(1.0, 200, S_dbscan)
Found the File!
Wall time: 588 ms
In [50]:
%%time
if os.path.isfile("PickleFiles/db_s1.pkl"):
    print("Found the File!")
    db_s1 = unpickleObject("db_s1")
else: db_s1 = dbs(0.8, 200, S_dbscan)
Found the File!
Wall time: 410 ms
In [51]:
%%time
if os.path.isfile("PickleFiles/db_s2.pkl"):
    print("Found the File!")
    db_s2 = unpickleObject("db_s2")
else: db_s2 = dbs(0.5, 200, S_dbscan)
Found the File!
Wall time: 309 ms
In [52]:
%%time
if os.path.isfile("PickleFiles/db_s3.pkl"):
    print("Found the File!")
    db_s3 = unpickleObject("db_s3")
else: db_s3 = dbs(0.5, 150, S_dbscan)
Found the File!
Wall time: 388 ms
In [53]:
%%time
if os.path.isfile("PickleFiles/db_s4.pkl"):
    print("Found the File!")
    db_s4 = unpickleObject("db_s4")
else: db_s4 = dbs(0.5, 250, S_dbscan)
Found the File!
Wall time: 312 ms
In [54]:
%%time
if os.path.isfile("PickleFiles/db_s5.pkl"):
    print("Found the File!")
    db_s5 = unpickleObject("db_s5")
else: db_s5 = dbs(0.5, 300, S_dbscan)
Found the File!
Wall time: 225 ms
In [55]:
%%time
if os.path.isfile("PickleFiles/db_s6.pkl"):
    print("Found the File!")
    db_s6 = unpickleObject("db_s6")
else: db_s6 = dbs(0.5, 350, S_dbscan)
Found the File!
Wall time: 234 ms
In [56]:
%%time
if os.path.isfile("PickleFiles/db_s7.pkl"):
    print("Found the File!")
    db_s7 = unpickleObject("db_s7")
else: db_s7 = dbs(0.5, 450, S_dbscan)
Found the File!
Wall time: 227 ms
In [57]:
%%time
if os.path.isfile("PickleFiles/db_s8.pkl"):
    print("Found the File!")
    db_s8 = unpickleObject("db_s8")
else: db_s8 = dbs(1, 450, S_dbscan) ## This is our WINNER
Found the File!
Wall time: 370 ms
In [58]:
%%time
if os.path.isfile("PickleFiles/db_s9.pkl"):
    print("Found the File!")
    db_s9 = unpickleObject("db_s9")
else: db_s9 = dbs(1.2, 450, S_dbscan)
Found the File!
Wall time: 416 ms
In [59]:
%%time
if os.path.isfile("PickleFiles/db_s.pkl"):
    print("Found the File!")
else: pickleObject(db_s, "db_s", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_s1.pkl"):
    print("Found the File!")
else: pickleObject(db_s1, "db_s1", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_s2.pkl"):
    print("Found the File!")
else: pickleObject(db_s2, "db_s2", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_s3.pkl"):
    print("Found the File!")
else: pickleObject(db_s3, "db_s3", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_s4.pkl"):
    print("Found the File!")
else: pickleObject(db_s4, "db_s4", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_s5.pkl"):
    print("Found the File!")
else: pickleObject(db_s5, "db_s5", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_s6.pkl"):
    print("Found the File!")
else: pickleObject(db_s6, "db_s6", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_s7.pkl"):
    print("Found the File!")
else: pickleObject(db_s7, "db_s7", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_s8.pkl"):
    print("Found the File!")
else: pickleObject(db_s8, "db_s8", filepath = "PickleFiles/")

if os.path.isfile("PickleFiles/db_s9.pkl"):
    print("Found the File!")
else: pickleObject(db_s9, "db_s9", filepath = "PickleFiles/")
Found the File!
Found the File!
Found the File!
Found the File!
Found the File!
Found the File!
Found the File!
Found the File!
Found the File!
Found the File!
Wall time: 4.01 ms

Spectral Clustering

The final clustering method we will apply to our data set is Spectral Clustering. Due to the multi-dimensional state of our data, our expectation is that spectral clustering will outperform K-means and DBSCAN clustering as it does not rely on distance measures directly but rather node connectivity. Another expected advantage to using Spectral Clustering is that it allows us to reduce the dimensionality of our data by clustering the data's first few Laplacian matrix eigenvectors only instead of all 22 attributes. As such, it does require a bit more data prep before clustering such as computing Similarity and Affinity, Degree, and Laplacian matrices, but the anticipated outcome is worth the effort. It is by calculating these Laplacian matrix eigenvectors that we are also able to identify an appropriate number of clusters to feed into the algorithm. Cluster count identification will be revealed and discussed in the Modeling and Evaluation Part 2 - Evaluate and Compare section.

For Spectral Clustering we chose to use R not only for a change of pace but also because documentation for running the algorithm in R is plentiful and application using R overlapped with one of our group members' projects at work. A similar outcome could have been reached with Python just as well, but integrating R offered a fun, new challenge on our Windows OS machines.

Before computing our required matrices, we must first intialize our rpy2.ipython session and pass to it our scaled Customer and Subscriber data. We also check to ensure the data imported correctly as an R dataframe and that attribute data types are as expected.

In [60]:
%load_ext rpy2.ipython
#%R install.packages("kernlab")
%R library(kernlab)
clear_display()
In [61]:
%%time

%R -i CitiBike_C
%R -i CitiBike_S
Wall time: 24.5 s
In [62]:
%%time
out = %R capture.output(str(CitiBike_C))

for line in out:
         print(line)
print("")
out = %R capture.output(str(CitiBike_S))

for line in out:
         print(line)
'data.frame':	250000 obs. of  23 variables:
 $ start_station_latitude : num  -1.946 0.427 -1.217 1.443 0.383 ...
 $ start_station_longitude: num  1.119 -0.453 -1.997 1.104 0.639 ...
 $ end_station_latitude   : num  -0.867 0.585 -1.052 1.27 0.392 ...
 $ end_station_longitude  : num  2.397 -0.809 -1.867 1.338 0.633 ...
 $ PRCP                   : num  -0.30592 -0.30592 -0.30592 4.08782 0.00792 ...
 $ SNOW                   : num  -0.0696 -0.0696 -0.0696 -0.0696 -0.0696 ...
 $ TAVE                   : num  1.285 -0.464 0.961 0.637 0.508 ...
 $ TMAX                   : num  1.259 -0.632 0.755 0.313 0.376 ...
 $ TMIN                   : num  1.319 -0.316 1.123 0.927 0.665 ...
 $ DayOfWeek_Friday       : num  -0.399 -0.399 -0.399 -0.399 -0.399 ...
 $ DayOfWeek_Monday       : num  -0.385 -0.385 -0.385 2.595 -0.385 ...
 $ DayOfWeek_Saturday     : num  -0.472 -0.472 -0.472 -0.472 2.118 ...
 $ DayOfWeek_Sunday       : num  2.268 -0.441 2.268 -0.441 -0.441 ...
 $ DayOfWeek_Thursday     : num  -0.382 -0.382 -0.382 -0.382 -0.382 ...
 $ DayOfWeek_Tuesday      : num  -0.386 -0.386 -0.386 -0.386 -0.386 ...
 $ DayOfWeek_Wednesday    : num  -0.389 2.573 -0.389 -0.389 -0.389 ...
 $ TimeOfDay_Afternoon    : num  1.845 -0.542 -0.542 -0.542 -0.542 ...
 $ TimeOfDay_Evening      : num  -0.694 1.44 1.44 1.44 -0.694 ...
 $ TimeOfDay_Midday       : num  -0.544 -0.544 -0.544 -0.544 1.838 ...
 $ TimeOfDay_Morning      : num  -0.407 -0.407 -0.407 -0.407 -0.407 ...
 $ TimeOfDay_Night        : num  -0.288 -0.288 -0.288 -0.288 -0.288 ...
 $ tripdurationLog        : num  1.187 -0.967 -0.315 -1.16 -1.902 ...
 $ usertype               : Factor w/ 1 level "Customer": 1 1 1 1 1 1 1 1 1 1 ...

'data.frame':	250000 obs. of  23 variables:
 $ start_station_latitude : num  0.0861 0.8863 -0.2608 1.1406 0.143 ...
 $ start_station_longitude: num  -0.5107 1.067 0.0947 1.504 1.0364 ...
 $ end_station_latitude   : num  0.273 0.273 0.101 -0.147 0.333 ...
 $ end_station_longitude  : num  -0.8597 -0.8597 -0.2377 0.0523 0.5648 ...
 $ PRCP                   : num  -0.149 -0.306 -0.306 -0.306 -0.306 ...
 $ SNOW                   : num  -0.0696 -0.0696 -0.0696 -0.0696 -0.0696 ...
 $ TAVE                   : num  -0.0107 -0.3347 -2.473 0.4429 -0.2051 ...
 $ TMAX                   : num  0.0614 -0.4427 -2.6481 0.2504 -0.0016 ...
 $ TMIN                   : num  -0.12 -0.251 -2.279 0.665 -0.447 ...
 $ DayOfWeek_Friday       : num  -0.399 -0.399 -0.399 -0.399 -0.399 ...
 $ DayOfWeek_Monday       : num  2.595 -0.385 -0.385 -0.385 -0.385 ...
 $ DayOfWeek_Saturday     : num  -0.472 -0.472 -0.472 -0.472 -0.472 ...
 $ DayOfWeek_Sunday       : num  -0.441 -0.441 -0.441 2.268 -0.441 ...
 $ DayOfWeek_Thursday     : num  -0.382 2.619 2.619 -0.382 -0.382 ...
 $ DayOfWeek_Tuesday      : num  -0.386 -0.386 -0.386 -0.386 -0.386 ...
 $ DayOfWeek_Wednesday    : num  -0.389 -0.389 -0.389 -0.389 2.573 ...
 $ TimeOfDay_Afternoon    : num  -0.542 -0.542 -0.542 -0.542 -0.542 ...
 $ TimeOfDay_Evening      : num  -0.694 1.44 -0.694 -0.694 -0.694 ...
 $ TimeOfDay_Midday       : num  -0.544 -0.544 -0.544 1.838 -0.544 ...
 $ TimeOfDay_Morning      : num  -0.407 -0.407 2.454 -0.407 -0.407 ...
 $ TimeOfDay_Night        : num  3.467 -0.288 -0.288 -0.288 3.467 ...
 $ tripdurationLog        : num  -1.51 0.12 -1.35 -1.64 -1.35 ...
 $ usertype               : Factor w/ 1 level "Subscriber": 1 1 1 1 1 1 1 1 1 1 ...
Wall time: 150 ms

After initial attempts to compute similarity matrices for all 250,000 transactions from each user type data set, it became quickly apparent we would need to sub-sample our data further. We discovered that the relationship between sample size and CPU time required to compute the Similarity, Affinity, Degree, and Laplacian matrices is not linear. For example, when random-sampling 500 transcations from each data set, matrix computations took 11 minutes to complete for each data set (22 minutes in total between both data sets). With 1000 samples randomly selected for matrix derivation, however, required CPU runtime significantly increased to 44 minutes per data set (that's 1.5 hours in total between both data sets). Given the significant amount of time spent running our DBSCAN code above, we opted to sample no more than 1000 transactions based on our project schedule. We later address this loss in granularity by predicting the remaining transactions' cluster IDs via classification methods in our Deployment section.

In [63]:
%%time
%%R
set.seed(100)
CitiBike_miniC <- CitiBike_C[sample(1:nrow(CitiBike_C), 1000, replace=FALSE),]
CitiBike_miniS <- CitiBike_S[sample(1:nrow(CitiBike_S), 1000, replace=FALSE),]
Wall time: 105 ms

In final preparation for matrix computation, we configure our Similarity and Affinity matrix functions which are a bit more extensive than our Degree and Laplacian matrix computations. These functions were adapted from a tutorial written by Joao Neto in December 2013. Similarly, we based our eigenvector derivations on this same tutorial.

In [64]:
%%time
%%R
# Similarity matrix computation functions
s <- function(x1, x2, alpha=1) {
  exp(- alpha * norm(as.matrix(x1-x2), type="F"))
}

make.similarity <- function(my.data, similarity) {
  N <- nrow(my.data)
  S <- matrix(rep(NA,N^2), ncol=N)
  for(i in 1:N) {
    for(j in 1:N) {
      S[i,j] <- similarity(my.data[i,], my.data[j,])
    }
  }
  S
}

# Affinity matrix computation functions
make.affinity <- function(S, n.neighboors=2) {
  N <- length(S[,1])

  if (n.neighboors >= N) {  # fully connected
    A <- S
  } else {
    A <- matrix(rep(0,N^2), ncol=N)
    for(i in 1:N) { # for each line
      # only connect to those points with larger similarity 
      best.similarities <- sort(S[i,], decreasing=TRUE)[1:n.neighboors]
      for (s in best.similarities) {
        j <- which(S[i,] == s)
        A[i,j] <- S[i,j]
        A[j,i] <- S[i,j] # to make an undirected graph, ie, the matrix becomes symmetric
      }
    }
  }
  A  
}
Wall time: 127 ms
Customer Spectral Clustering

We first tackle the Customer data set by calculating our Similarity, Affinity, Degree, and Laplacian matrices. Due to matrix computation CPU requirements (42min 43s), we save our Laplacian matrix object as a RDS file for readback, similar to our previous pickles.

The Similarity and Affinity matrices serve a similar purpose to an Adjacency matrix by indicating node proximity with edges indicating links. Once the Affinity matrix is produced, it is used to compute the Degree matrix which is a diagonal matrix summing the number of edges for each node. Next, the Laplacian matrix is computed from both the Degree and Affinity matrices. We opted to implement the unnormalized graph Laplacian for our application as our data is already scaled. The formula for this is simply: $Laplacian = Degree - Affinity$

In [65]:
%%time
%%R
set.seed(100)
if(!file.exists('PickleFiles/CustLap.rds')){
    S <- make.similarity(CitiBike_miniC[,1:22], s)
    A <- make.affinity(S, 3)  # use 3 neighboors (includes self)
    D <- diag(apply(A, 1, sum)) # sum rows
    U <- D - A
    saveRDS(U, 'PickleFiles/CustLap.rds')
}
else{U <- readRDS('PickleFiles/CustLap.rds')}
Wall time: 151 ms

After the final Laplacian matrix is complete, we move on to eigenvector computation. Specifically, we are interested in deriving the first eigenvector and the second eigenvector, or Fiedler Vector. These are required for us to cluster our data in a reduced dimensional space by identifying a location of normalized cut in the Fiedler Vector. This cut location is revealed by analyzing both the number of edges in each cluster that do not connect to other clusters and the number of edges which do connect each cluster. Similarly, relative connectivity and interconnectivity may be used to evaluate a proper cut: $RC*RI^a$

In [66]:
%%time
%%R
set.seed(100)
k   <- 2
evL <- eigen(U, symmetric=TRUE)
Z   <- evL$vectors[,(ncol(evL$vectors)-k+1):ncol(evL$vectors)]
evLc <- evL
Wall time: 1.85 s

The above matrix and eigenvalue computations are the raw, manual methods used to transform the data set into a reduced dimensional space which can be easily clustered. Doing so manually is essential to identifying the correct number of clusters to feed into our clustering algorithm. This cluster count identification will be discussed in detail in the Modeling and Evaluation Part 2 - Evaluate and Compare section, but it suffices us to say for now that the appropriate number of clusters for customer data is 3.

Fortunately, besides computing eigenvectors for cluster count identification, the specc R library can perform these computations for us and then cluster on the produced eigenvectors. All that is required of us beforehand is to convert our 1000 sample dataframe into a matrix.

In [67]:
%%time
%%R
set.seed(100)
#%R mCiti <- data.matrix(CitiBike_mini[,5:22])
mCitiC <- data.matrix(CitiBike_miniC)
colnames(mCitiC) <- colnames(CitiBike_miniC)
Wall time: 118 ms

The specc Spectral Clustering library function provides a number of parameters we can adjust to obtain varying cluster results. Permutations are provided below and are fast enough that no further RDS object storage is required.

  • Configuration 1: centers = 3, iterations = 200, kernel = 'rbfdot', nystrom.red = FALSE
  • Configuration 2: centers = 3, iterations = 500, kernel = 'rbfdot', nystrom.red = FALSE
  • Configuration 3: centers = 3, iterations = 1000, kernel = 'rbfdot', nystrom.red = FALSE
  • Configuration 4: centers = 3, iterations = 1000, kernel = 'laplacedot', nystrom.red = FALSE
  • Configuration 5: centers = 3, iterations = 1000, kernel = 'polydot', nystrom.red = FALSE
  • Configuration 6: centers = 3, iterations = 1000, kernel = 'rbfdot', nystrom.red = TRUE

Centers defines the number of clusters to produce as one would expect. Iterations in this case defines the maximum number of iterations allowed for the clustering algorithm to perform; the default number of iterations is 200. The kernel argument defines the kernel function used to compute the affinity matrix within the specc call; the default is 'rbfdot' which is the Radial Basis, or Gaussian, kernel function. Other options provided in our configurations include 'laplacedot' and 'polydot', where 'laplacedot' is the Laplacian kernel function and 'polydot' is the Polynomial kernel function. Finally, nystrom.red is a boolean argument signifying whether to use Nystrom's Method to calculate eigenvectors. The default is False, but when changed to True, a sample of the dataset is used to compute the eigenvalues so that only an $n*m$ matrix ($n$ being the sample size) is stored in memory. We use the default number of Nystrom's Method samples which is $\frac{nrow}{6}$, or 167 given our data.

In [68]:
%%time
%%R
set.seed(100)
sc_c <- specc(mCitiC, centers = 3)
Wall time: 29.6 s
In [69]:
%%time
%%R
set.seed(100)
sc_c1 <- specc(mCitiC, centers = 3, iterations = 500)
Wall time: 29.3 s
In [70]:
%%time
%%R
set.seed(100)
sc_c2 <- specc(mCitiC, centers = 3, iterations = 1000)
Wall time: 28.2 s
In [71]:
%%time
%%R
set.seed(100)
sc_c3 <- specc(mCitiC, centers = 3, iterations = 1000, kernel = 'laplacedot')
Wall time: 28.7 s
In [72]:
%%time
%%R
set.seed(100)
sc_c4 <- specc(mCitiC, centers = 3, iterations = 1000, kernel = 'polydot')
Wall time: 27.6 s
In [73]:
%%time
%%R
set.seed(100)
sc_c5 <- specc(mCitiC, centers = 3, iterations = 1000, kernel = 'rbfdot', nystrom.red = TRUE)
Wall time: 900 ms
Subscriber Spectral Clustering

We next tackle the Subscriber data set by calculating our Similarity, Affinity, Degree, and Laplacian matrices. Due to matrix computation CPU requirements (48min 29s), we save our Laplacian matrix object as a RDS file for readback, as was done with Customer spectral clustering.

In [74]:
%%time
%%R
set.seed(100)
if(!file.exists('PickleFiles/SubsLap.rds')){
    S <- make.similarity(CitiBike_miniS[,1:22], s)
    A <- make.affinity(S, 3)  # use 3 neighboors (includes self)
    D <- diag(apply(A, 1, sum)) # sum rows
    U <- D - A
    saveRDS(U, 'PickleFiles/SubsLap.rds')
}
else{U <- readRDS('PickleFiles/SubsLap.rds')}
Wall time: 75.7 ms

After the final Laplacian matrix is complete, we move on to eigenvector computation to obtain our first eigenvector and Fiedler Vector.

In [75]:
%%time
%%R
set.seed(100)
k   <- 2
evL <- eigen(U, symmetric=TRUE)
Z   <- evL$vectors[,(ncol(evL$vectors)-k+1):ncol(evL$vectors)]
evLs <- evL

signif(evL$values,2) # eigenvalues are in decreasing order
Wall time: 2.09 s

The above matrix and eigenvalue computations are the raw, manual methods used to transform the data set into a reduced dimensional space which can be easily clustered. Doing so manually is essential to identifying the correct number of clusters to feed into our clustering algorithm. This cluster count identification will be discussed in detail in the Modeling and Evaluation Part 2 - Evaluate and Compare section, but it suffices us to say for now that the appropriate number of clusters to start with for customer data is 4. 5 clusters were also identified as a possible candidate and will be explored here as well.

Again, the specc R library will re-perform matrix computations for us and then cluster on the produced eigenvectors. All that is required of us beforehand is to convert our 1000 sample dataframe into a matrix.

In [76]:
%%time
%%R
set.seed(100)
#%R mCiti <- data.matrix(CitiBike_mini[,5:22])
mCitiS <- data.matrix(CitiBike_miniS)
colnames(mCitiS) <- colnames(CitiBike_miniS)
Wall time: 78.5 ms

Subscriber data permutations are provided below and are fast enough that no further RDS object storage is required. Notice that 4 centers were first defined but since the cluster sizes were not well distributed, 5 clusters was defined instead (we had some flexibility to use either 4 or 5 centers based on cluster count analysis as will be described later).

  • Configuration 1: centers = 4, iterations = 200, kernel = 'rbfdot', nystrom.red = FALSE
  • Configuration 2: centers = 5, iterations = 200, kernel = 'rbfdot', nystrom.red = FALSE
  • Configuration 3: centers = 5, iterations = 500, kernel = 'rbfdot', nystrom.red = FALSE
  • Configuration 4: centers = 5, iterations = 1000, kernel = 'rbfdot', nystrom.red = FALSE
  • Configuration 5: centers = 5, iterations = 1000, kernel = 'laplacedot', nystrom.red = FALSE
  • Configuration 6: centers = 5, iterations = 1000, kernel = 'polydot', nystrom.red = FALSE
  • Configuration 7: centers = 5, iterations = 1000, kernel = 'rbfdot', nystrom.red = TRUE

Centers defines the number of clusters to produce as one would expect. Iterations in this case defines the maximum number of iterations allowed for the clustering algorithm to perform; the default number of iterations is 200. The kernel argument defines the kernel function used to compute the affinity matrix within the specc call; the default is 'rbfdot' which is the Radial Basis, or Gaussian, kernel function. Other options provided in our configurations include 'laplacedot' and 'polydot', where 'laplacedot' is the Laplacian kernel function and 'polydot' is the Polynomial kernel function. Finally, nystrom.red is a boolean argument signifying whether to use Nystrom Method to calculate eigenvectors. The default is False, but when changed to True, a sample of the dataset is used to compute the eigenvalues so that only an $n*m$ matrix ($n$ being the sample size) is stored in memory. We use the default number of Nystrom Method samples which is $\frac{nrow}{6}$, or 167 given our data.

In [77]:
%%time
%%R
set.seed(1000)
sc_s <- specc(mCitiS, centers = 4)
Wall time: 29.4 s
In [78]:
%%time
%%R
set.seed(1000)
sc_s6 <- specc(mCitiS, centers = 5)
Wall time: 28.9 s
In [79]:
%%time
%%R
set.seed(1000)
sc_s1 <- specc(mCitiS, centers = 5, iterations = 500)
Wall time: 29.4 s
In [80]:
%%time
%%R
set.seed(1000)
sc_s2 <- specc(mCitiS, centers = 5, iterations = 1000)
Wall time: 29.7 s
In [81]:
%%time
%%R
set.seed(1000)
sc_s3 <- specc(mCitiS, centers = 5, iterations = 1000, kernel = 'laplacedot')
Wall time: 28.7 s
In [82]:
%%time
%%R
set.seed(1000)
sc_s4 <- specc(mCitiS, centers = 5, iterations = 1000, kernel = 'polydot')
Wall time: 30.6 s
In [83]:
%%time
%%R
set.seed(1000)
sc_s5 <- specc(mCitiS, centers = 5, iterations = 1000, kernel = 'rbfdot', nystrom.red = TRUE)
Wall time: 388 ms

Modeling and Evaluation Part 2 - Evaluate and Compare

K-Means Clustering

One of the biggest aspects about Kmeans is that it's easy to train the dataset to find local optimums that don't represent the greater dataset. Example is below for each dataset using the inertia results for different initialization values:

In [84]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm

from sklearn.utils import shuffle
from sklearn.utils import check_random_state
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import KMeans

kmeans_sub_y = CitiBike_S['usertype']
kmeans_sub_X = CitiBike_S[["start_station_latitude","start_station_longitude","end_station_latitude","end_station_longitude","PRCP","SNOW","TAVE","TMAX","TMIN","DayOfWeek_Friday","DayOfWeek_Monday","DayOfWeek_Saturday","DayOfWeek_Sunday","DayOfWeek_Thursday","DayOfWeek_Tuesday","DayOfWeek_Wednesday","TimeOfDay_Afternoon","TimeOfDay_Evening","TimeOfDay_Midday","TimeOfDay_Morning","TimeOfDay_Night","tripdurationLog"]]
kmeans_cus_y = CitiBike_C['usertype']
kmeans_cus_X = CitiBike_C[["start_station_latitude","start_station_longitude","end_station_latitude","end_station_longitude","PRCP","SNOW","TAVE","TMAX","TMIN","DayOfWeek_Friday","DayOfWeek_Monday","DayOfWeek_Saturday","DayOfWeek_Sunday","DayOfWeek_Thursday","DayOfWeek_Tuesday","DayOfWeek_Wednesday","TimeOfDay_Afternoon","TimeOfDay_Evening","TimeOfDay_Midday","TimeOfDay_Morning","TimeOfDay_Night","tripdurationLog"]]

n_runs = 5
n_init_range = np.array([1, 5, 10, 15, 20])
fig = plt.figure()
plots = []
legends = []

cases = [
    (KMeans, 'k-means++', {}),
    (KMeans, 'random', {})
]

for factory, init, params in cases:
    inertia = np.empty((len(n_init_range), n_runs))

    for run_id in range(n_runs):
        for i, n_init in enumerate(n_init_range):
            km = factory(n_clusters=11, init=init, random_state=run_id,
                         n_init=n_init, **params).fit(kmeans_sub_X)
            inertia[i, run_id] = km.inertia_
    p = plt.errorbar(n_init_range, inertia.mean(axis=1), inertia.std(axis=1))
    plots.append(p[0])
    legends.append("%s with %s init" % (factory.__name__, init))

plt.xlabel('n_init')
plt.ylabel('inertia')
plt.legend(plots, legends)
plt.title("Subscribers: Mean inertia for various k-means init across %d runs" % n_runs)

In [85]:
for factory, init, params in cases:
    inertia = np.empty((len(n_init_range), n_runs))

    for run_id in range(n_runs):
        for i, n_init in enumerate(n_init_range):
            km = factory(n_clusters=11, init=init, random_state=run_id,
                         n_init=n_init, **params).fit(kmeans_cus_X)
            inertia[i, run_id] = km.inertia_
    p = plt.errorbar(n_init_range, inertia.mean(axis=1), inertia.std(axis=1))
    plots.append(p[0])
    legends.append("%s with %s init" % (factory.__name__, init))

plt.xlabel('n_init')
plt.ylabel('inertia')
plt.legend(plots, legends)
plt.title("Customers: Mean inertia for various k-means init across %d runs" % n_runs)

While the inertia score quickly drops at 5 initializations, it continues to drop at around 20. While not evident here, we also continued out to 50 initializations and found that the inertia was still decreasing. Ultimately we would settle on around 10 initializations for the customer set and 15 for the subscriber set.

Kmeans is a difficult clustering algorithm as it requires a guided approach, first in knowing how many clusters you need to prepare and then ultimately having a good idea of what your dataset is supposed to look like once you've finished preparing your data for modeling.

Our dataset is a complex and primarily qualitative set with over 100 attributes, of which only 22 were identified via PCA in Lab 2 and then used for clustering analysis. Moreover, we're attempting to perform a classification into one of two categories, and while our approach of splitting our dataset into two groups and looking at confidence at how well new observations fit within the observed clusters is novel, it doesn't mesh very well with some of the assumptions for Kmeans. Admittedly, Kmeans didn't require random sampling to be run across our already sampled dataset; it was, however, difficult to measure because of our data's lack of uniformity.

DBSCAN Clustering

As alluded to previously, DBSCAN is a poorly efficient algorithm for our application. We wanted so much to like it but couldn't compel ourselves to do so. Since the number of clusters are determined based on density metrics eps and minimum core samples, we endeavored to simply generate a reasonable number of customer and subscriber clusters for starters. Sadly enough, this turned into our primary metric for DBSCAN evaluation in the end due to rediculous computation times given our workstation limitations which in turn stressed our project deliverables. Since we also knew the "curse of dimensionality" was not working in our favor in terms of producing valid DBSCAN results on our 18 attribute data sets (remember, coordinates were removed), we deemed it more worthy of our time to move on to Spectral Clustering.

Customer DBSCAN
In [86]:
cluster_id_customer = pd.DataFrame ({"Cluster_ID_Customer": db_c7.labels_, "Cluster_ID_Subscriber": np.NaN})
In [87]:
Cust_clus_DB = pd.concat([CitiBike_C.reset_index(), cluster_id_customer], axis=1)
#CitiBike_C = pd.concat([CitiBike_C, cluster_id_customer], axis=1)

Portrayed in the table below are the cluster counts for each permutation of DBSCAN on our customer data. Our directive approach to eps and core sample changes was in the right direction with our first iteration producing a whopping 103 clusters and the last iteration producing 49 clusters (at the significant expense of computation time). Based on the number of clusters produced and our prior knowledge regarding the ride behaviors of customers, 49 clusters is still unreasonably large. Given the $n^2$ computations, time required to complete these computations, and observable increase in time required the fewer number of clusters produced, we chose to stop here.

In [88]:
clus_counts = [db_c.labels_.max() + 1,
               db_c1.labels_.max() + 1,
               db_c2.labels_.max() + 1,
               db_c3.labels_.max() + 1,
               db_c4.labels_.max() + 1,
               db_c5.labels_.max() + 1,
               db_c6.labels_.max() + 1,
               db_c7.labels_.max() + 1]

clus_counts = pd.DataFrame({'Cluster_Counts' : clus_counts})
clus_counts.index.names = ['Permutation']
clus_counts.T
Out[88]:
Permutation 0 1 2 3 4 5 6 7
Cluster_Counts 103 84 69 57 52 51 59 49

Cluster size distributions for the final customer permutation are depicted in the table and histogram below. It is interesting to see that even with so many clusters, there is still significant imbalance in cluster size from one grouping to the next. Given our expectations for this data set, this further proves DBSCAN to be insufficient for our needs. Of good report is there were only 2,570 outliers represented in our customer data (average transactions per cluster were 5,000 when including outliers).

In [89]:
Agg = pd.DataFrame({'count' : Cust_clus_DB.groupby(["Cluster_ID_Customer"]).size()}).reset_index()

display(Agg.T)
print("Min. transactions per cluster (including outliers) =", Agg['count'].min())
print("Max. transactions per cluster (including outliers) =", Agg['count'].max())
print("Average transactions per cluster (including outliers) =", Agg['count'].mean())
Agg.hist('count', color = 'orange', bins = 20)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
Cluster_ID_Customer -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
count 2570 16091 7435 15147 410 16636 5335 18407 7759 2407 17121 5839 7257 2796 9239 4325 2352 4304 8356 5878 2429 5937 5727 2210 418 7897 8197 15814 6426 2145 5593 2835 7569 787 1922 2637 661 888 422 269 2090 247 276 931 330 2311 2230 237 689 212
Min. transactions per cluster (including outliers) = 212
Max. transactions per cluster (including outliers) = 18407
Average transactions per cluster (including outliers) = 5000.0
Out[89]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E63B3C6048>]], dtype=object)
Subscriber DBSCAN
In [90]:
cluster_id_subscriber = pd.DataFrame ({"Cluster_ID_Customer": np.NaN, "Cluster_ID_Subscriber": db_s8.labels_})
In [91]:
Sub_clus_DB = pd.concat([CitiBike_S.reset_index(), cluster_id_subscriber], axis=1)
#CitiBike_C = pd.concat([CitiBike_C, cluster_id_customer], axis=1)
#Sub_clus_DB.tail(10)

Portrayed in the table below are the cluster counts for each permutation of DBSCAN on our subscriber data. Our directive approach to eps and core sample changes this time was not in the best direction as our first permutation produced 71 clusters and our subsequent permutations increased the number of clusters produced. Though our eps and minimum core samples parameters were eventually corrected, we decided to implement a couple additional permutations as well in attempts to drive the number of clusters down further (our smallest count being 36). Based on the number of clusters produced and our prior knowledge regarding the ride behaviors of subscribers, even 36 clusters is still unreasonably large. Given the $n^2$ computations, time required to complete these computations, and observable increase in time required the fewer number of clusters produced, we chose to stop here.

In [92]:
clus_counts = [db_s.labels_.max() + 1,
               db_s1.labels_.max() + 1,
               db_s2.labels_.max() + 1,
               db_s3.labels_.max() + 1,
               db_s4.labels_.max() + 1,
               db_s5.labels_.max() + 1,
               db_s6.labels_.max() + 1,
               db_s7.labels_.max() + 1,
               db_s8.labels_.max() + 1,
               db_s9.labels_.max() + 1]

clus_counts = pd.DataFrame({'Cluster_Counts' : clus_counts})
clus_counts.index.names = ['Permutation']
clus_counts.T
Out[92]:
Permutation 0 1 2 3 4 5 6 7 8 9
Cluster_Counts 71 85 119 158 97 82 64 48 36 41

Cluster size distributions for the permutation producing 36 clusters are depicted in the table and histogram below. It is interesting to see that even with so many clusters, there is still significant imbalance in cluster size from one grouping to the next. The disparity between sizes is quite large for subscriber DBSCAN results as well. For example, the 36th cluster only contains 465 transactions whereas the 5th cluster contains 13,392 and there are a total of 40,669 outliers! While this is the "best" subscriber configuration, it still falls short of our needs.

In [93]:
Agg = pd.DataFrame({'count' : Sub_clus_DB.groupby(["Cluster_ID_Subscriber"]).size()}).reset_index()

display(Agg.T)
print("Min. transactions per cluster (including outliers) =", Agg['count'].min())
print("Max. transactions per cluster (including outliers) =", Agg['count'].max())
print("Average transactions per cluster (including outliers) =", Agg['count'].mean())
Agg.hist('count', color = 'violet', bins = 20)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Cluster_ID_Subscriber -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
count 40669 11893 8008 6031 13340 13392 6778 6372 5625 11643 5296 12609 8738 4822 9104 7787 5329 4943 1991 6616 4966 7566 1571 1750 7017 4798 8347 6366 4652 5118 2094 1244 1069 771 719 501 465
Min. transactions per cluster (including outliers) = 465
Max. transactions per cluster (including outliers) = 40669
Average transactions per cluster (including outliers) = 6756.75675676
Out[93]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E63A8591D0>]], dtype=object)

Spectral Clustering

Since we are trying to identify underlying traits exhibited within customer and subscriber classes across 22 attributes, we went into our spectral clustering methodology with high hopes. Fortunately, our results did not disappoint. By producing the Laplacian matrix and its associated eigenvectors, we were able to cluster data within a spectral space rather than the feature space.

Unlike DBSCAN and more like K-means, we were required to identify upfront an appropriate number of clusters to have the algorithm produce. Unlike K-means, however, this was done by analyzing the eigenvalue spectrum for customer and subscriber data sets and locating the point at which the largest gap occurs between eigenvalues for each cluster count. This will be described in further detail in the respective customer/subscriber sections that follow.

Customer Spectral Clustering

Before implementing our Spectral Clustering permutations in the Modeling and Evaluation Part 1 - Train and adjust parameters section above, we calculated Similarity, Affinity, Degree, and Laplacian matrices manually. We also calculated Laplacian matrix eigenvalues at that time. This preparation was all meant to lead up to the eigenvalue spectrum plot displayed below.

In this plot cluster counts are displayed on the x-axis with Laplacian eigenvalues on the y-axis. The dotted red line indicates the largest gap between consecutive eigenvalues and is generally the point at which the lower of the gap values is selected. Therefore, we identified 3 clusters as being the optimum count for our customer data and 3 is what we entered into our Spectral Clustering permutations earlier. More information regarding our approach is available in Ulrike von Luxburg's paper, A Tutorial on Spectral Clustering.

In [94]:
%%time
%%R
plot(1:10, rev(evLc$values)[1:10], xlab = "Customer Clusters", ylab = "Laplacian Eigenvalues", main = "Customer Eigenvalue Spectrum")
abline(v=3.5, col="red", lty=2) # largest eigenvalue gap between 3 and 4 clusters
Wall time: 375 ms

Before jumping straight to our permutations, however, we did take interest in comparing within sum of squares while varying cluster counts. Below is a plot depicting the within sum of squares with each number of clusters. Note the only parameter that is being changed in our function call is the number of clusters.

As can be seen, while 3 clusters does not produce the smallest within sum of squares, its Wss is the third smallest. This seems to be good compromise given the performance of the eignevalues above.

In [95]:
%%time
%%R
set.seed(100)
wss <- 0
wss[1] <- ((length(mCitiC)/length(colnames(mCitiC)))-1)*sum(apply(mCitiC,2,var))
for (i in 2:22) wss[i] <- sum(withinss(specc(mCitiC, centers = i)))
    
plot(1:22, wss, type="b", xlab="Number of Customer Clusters", ylab="Within groups sum of squares", main = "Customer WSS", col = 'blue')
Wall time: 9min 34s
In [96]:
%%time
%%R
wss <- c(sum(withinss(sc_c)),
         sum(withinss(sc_c1)),
         sum(withinss(sc_c2)),
         sum(withinss(sc_c3)),
         sum(withinss(sc_c4)),
         sum(withinss(sc_c5)))

siz <- c(paste(size(sc_c), sep = '', collapse = ', '),
         paste(size(sc_c1), sep = '', collapse = ', '),
         paste(size(sc_c2), sep = '', collapse = ', '),
         paste(size(sc_c3), sep = '', collapse = ', '),
         paste(size(sc_c4), sep = '', collapse = ', '),
         paste(size(sc_c5), sep = '', collapse = ', '))

its <- c(200,500,1000,1000,1000,1000)
cen <- c(3,3,3,3,3,3)
ker <- c("rbfdot", "rbfdot", "rbfdot", "laplacedot", "polydot", "rbfdot")
nys <- c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)

Cust_SC <- data.frame(cen, its, ker, nys, wss, siz)
Cust_SC <- setNames(Cust_SC, c("Cluster.Count", "Iterations", "Kernel", "Nystrom.Method", "wss", "Sizes"))
Wall time: 109 ms

The theme of using within sum of squares continues as we now review the outcome of each tried configuration. Surprisingly, there was no difference in Wss or cluster size distributions as displayed in the table below (ordered by smallest to largest Wss). The only deviation from a Wss value of 24,927 was when choosing to implement the Nystrom Method (Wss = 22,842). Again, the Nystrom Method further samples the data when calculating the Laplacian eigenvectors. Interestingly enough, this approach not only produced the smallest Wss but also produced the most evenly distributed cluster sizes.

In [97]:
%%time
%R Cust_SC[order(wss),]
Wall time: 193 ms
Out[97]:
Cluster.Count Iterations Kernel Nystrom.Method wss Sizes
6 3.0 1000.0 rbfdot 1 22841.930962 261, 443, 296
1 3.0 200.0 rbfdot 0 24927.297101 40, 465, 495
2 3.0 500.0 rbfdot 0 24927.297101 40, 465, 495
3 3.0 1000.0 rbfdot 0 24927.297101 40, 465, 495
4 3.0 1000.0 laplacedot 0 24927.297101 40, 465, 495
5 3.0 1000.0 polydot 0 24927.297101 40, 465, 495

Within sum of squares between each permutation are again depicted below in line chart form. Notice all Wss values are the same except the 6th permutation method's which utilizes Nystrom's Method for eigenvector calculation.

In [98]:
Cust_SC = %R Cust_SC
fig = plt.figure()
plt.plot(Cust_SC.index.values, Cust_SC.wss, 'ro-', linewidth=2)
plt.title('Total Within Sum of Squares')
plt.xlabel('Configuration')
plt.ylabel('Total Wss')
plt.annotate('Selected Configuration', xy=(6, 22841), xytext=(4.2, 22550),
            arrowprops=dict(facecolor='blue', shrink=0.05),)

plt.show()

Another note about the Nystrom Method is its speed. While running our algorithm without Nystrom's Method enabled, our runtimes averaged about 30 seconds. Our algorithm with Nystrom's Method enabled averaged between only 400ms and 600ms. This is approximately a 98.3% reduction in runtime while still producing the best results! So 3 clusters, 1000 iterations, the Radial Basis kernel, and Nystrom Method enabled takes the clear win among customer Spectral Clustering permutations.

Subscriber Spectral Clustering

Similar to customer spectral clustering, we had calculated Similarity, Affinity, Degree, and Laplacian matrices manually in the Modeling and Evaluation Part 1 - Train and adjust parameters section along with the Laplacian matrix eigenvalues. This, too, led up to the eigenvalue spectrum plot displayed as follows.

Subscriber Laplacian eigenvalues were a bit different from customer eigenvalues in that substantial gaps exist between both 4 and 5 clusters and 5 and 6 clusters. This means that either the first 5 or the first 6 eigenvectors contain all the cluster information needed to describe the subscriber data. For this reason we tried both counts in our subscriber permutations.

In [99]:
%%time
%%R
plot(1:10, rev(evLs$values)[1:10], xlab = "Subscriber Clusters", ylab = "Laplacian Eigenvalues", log = "y", main = "Subscriber Eigenvalue Spectrum")
abline(v=4.5, col="red", lty=2) # largest eigenvalue gap between 4 and 5 clusters
abline(v=5.5, col="blue", lty=3) # Potential gap between 5 and 6 clusters too
Wall time: 18.5 ms

Again, we wanted to review within sum of squares while varying cluster counts before proceeding with trying our different parameters. Below is a plot depicting the within sum of squares with each number of clusters. Note the only parameter that is being changed in our function call is the number of clusters.

As can be seen, both 4 and 5 clusters exemplify relatively small Wss values but are still larger than 2 or 3 clusters. It's also interesting to observe there is a dip in Wss values between 3 and 4 clusters before increasing again with 5 clusters. Again, while it is interesting to review this relationship between Wss and cluster count, our real strategy with Spectral Clustering is to select how many clusters to use based on the Laplacian eigenvalues.

In [100]:
%%time
%%R
set.seed(100)
wss <- 0
wss[1] <- ((length(mCitiS)/length(colnames(mCitiS)))-1)*sum(apply(mCitiS,2,var))
for (i in 2:22) wss[i] <- sum(withinss(specc(mCitiS, centers = i)))
    
plot(1:22, wss, type="b", xlab="Number of Subscriber Clusters", ylab="Within groups sum of squares", main = "Subscriber WSS", col = 'red')
Wall time: 9min 48s
In [101]:
%%time
%%R
wss <- c(sum(withinss(sc_s)),
         sum(withinss(sc_s6)),
         sum(withinss(sc_s1)),
         sum(withinss(sc_s2)),
         sum(withinss(sc_s3)),
         sum(withinss(sc_s4)),
         sum(withinss(sc_s5)))

siz <- c(paste(size(sc_s), sep = '', collapse = ', '),
         paste(size(sc_s6), sep = '', collapse = ', '),
         paste(size(sc_s1), sep = '', collapse = ', '),
         paste(size(sc_s2), sep = '', collapse = ', '),
         paste(size(sc_s3), sep = '', collapse = ', '),
         paste(size(sc_s4), sep = '', collapse = ', '),
         paste(size(sc_s5), sep = '', collapse = ', '))

its <- c(200,200,500,1000,1000,1000,1000)
cen <- c(4,5,5,5,5,5,5)
ker <- c("rbfdot", "rbfdot", "rbfdot", "rbfdot", "laplacedot", "polydot", "rbfdot")
nys <- c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)

Sub_SC <- data.frame(cen, its, ker, nys, wss, siz)
Sub_SC <- setNames(Sub_SC, c("Cluster.Count", "Iterations", "Kernel", "Nystrom.Method", "wss", "Sizes"))
Wall time: 12.5 ms

Again, with an exception made for the Nystrom Method configuration, there was no difference in Wss or cluster size distributions as displayed in the table below for 5 clusters. There was, however, an expected difference between 4 and 5 cluster Wss values. What is unexpected is that the 4 cluster Wss value is larger than the 5 cluster value. This may be due to the randomized sampling method utilized by the R specc library. Regardless, the Nystrom Method still outperforms all other configurations in both Wss value and cluster size distributions.

In [102]:
%%time
%R Sub_SC[order(wss),]
Wall time: 10.6 ms
Out[102]:
Cluster.Count Iterations Kernel Nystrom.Method wss Sizes
7 5.0 1000.0 rbfdot 1 26926.132676 199, 79, 290, 293, 139
2 5.0 200.0 rbfdot 0 28262.798889 975, 10, 11, 2, 2
3 5.0 500.0 rbfdot 0 28262.798889 975, 10, 11, 2, 2
4 5.0 1000.0 rbfdot 0 28262.798889 975, 10, 11, 2, 2
5 5.0 1000.0 laplacedot 0 28262.798889 975, 10, 11, 2, 2
6 5.0 1000.0 polydot 0 28262.798889 975, 10, 11, 2, 2
1 4.0 200.0 rbfdot 0 30523.417367 4, 140, 6, 850

Within sum of squares between each permutation are again depicted below in line chart form. Nystrom's Method clearly takes the win for smallest Wss value.

In [103]:
Sub_SC = %R Sub_SC
fig = plt.figure()
plt.plot(Sub_SC.index.values, Sub_SC.wss, 'ro-', linewidth=2)
plt.title('Total Within Sum of Squares')
plt.xlabel('Configuration')
plt.ylabel('Total Wss')
plt.annotate('Selected Configuration', xy=(7, 26926), xytext=(5.2, 26550),
            arrowprops=dict(facecolor='blue', shrink=0.05),)

plt.show()

Again worth noting is the Nystrom Method's speed. While running our algorithm without Nystrom's Method enabled, our runtimes averaged about 30 seconds. Our algorithm with Nystrom's Method enabled averaged between only 400ms and 600ms. This is approximately a 98.3% reduction in runtime while still producing the best results! So 5 clusters, 1000 iterations, the Radial Basis kernel, and Nystrom Method enabled takes the clear win among subscriber Spectral Clustering permutations.

The Final Winner

Based on the evaluations above, we have identified Spectral Clustering with 3 clusters, 1000 iterations, the Radial Basis kernel implemented, and Nystrom Method enabled as the clear winner for customer data and 5 clusters, 1000 iterations, the Radial Basis kernel implemented, and Nystrom Method enabled as the clear winner for subscriber data. It produces the smallest within sum of squares in the shortest amount of time. While all our other clustering methods suffer from the "curse of dimensionality," the effects of large dimensionality in the feature space are significantly mitigated in the spectral space. All this is accomplished while producing a small, manageable amount of evenly sized clusters.

Below, we merge the new cluster IDs with the scaled CitiBike customer and subscriber data and pickle the dataframe for further use as described in our Deployment section.

As the final Spectral Clustering permutation is the only one we care about in the end, it will be our primary focus as we visualize cluster results in the Modeling and Evaluation Part 3 - Visualize Results section.

In [104]:
%R Cust_clus <- cbind(CitiBike_miniC, Cluster_ID_Customer = sc_c5, Cluster_ID_Subscriber = NaN)
clear_display()
In [105]:
Cust_clus = %R Cust_clus

if os.path.isfile("PickleFiles/Cust_clus.pkl"):
    print("File already created")
else: pickleObject(Cust_clus, "Cust_clus", filepath = "PickleFiles/")
File already created
In [106]:
%R Sub_clus <- cbind(CitiBike_miniS, Cluster_ID_Customer = NaN, Cluster_ID_Subscriber = sc_s5)
clear_display()
In [107]:
Sub_clus = %R Sub_clus

if os.path.isfile("PickleFiles/Sub_clus.pkl"):
    print("File already created")
else: pickleObject(Sub_clus, "Sub_clus", filepath = "PickleFiles/")
File already created

Modeling and Evaluation Part 3 - Visualize Results

With spectral clustering selected as our winner, we wanted visualize these identified clusters across the correlation between a few key attributes. Originally, during this process, we plotted all clusters at the same time with different colors representing clusters on the plot. When doing this, we quickly identified that this was not a feasible way to interpret the clusters because of how many overlapping data points we saw between clusters. (Example below)

These overlapping data points and noise produced is likely due to the highly dimensional inputs utilized for our cluster model fit. Limitations of 3D visualizations do not give us nearly enough to truly see obvious clusters produced simply because we do not have the entire picture in the graph.

Albeit we have encountered these challenges, we have decided to focus on 3D scatterplots for the relationship TripDurationLog has on Location and Weather data in respect to cluster assignments for both customer fit clusters and subscriber fit clusters. Given the noisy plots produced, we decided to subplot these visualizations by cluster id. This allows us to compare the clusters side by side instead of overlayed on top of each other. For each correlation 3d plot produced, we analyze the results from three different views by rotating x, y, and z axes. This helps us to see the data from different angles and elevations, potentially revealing additional insights.

In [108]:
Sub_clus = unpickleObject("Sub_clus")
Cust_clus = unpickleObject("Cust_clus")
In [109]:
def scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el):
    #fig = plt.figure()
    #ax = fig.gca(projection='3d')
    ax.scatter(x, y, z, c=cs, cmap = 'prism', zdir = 'z')
    ax.set_xlabel(xLab)
    ax.set_ylabel(yLab)
    ax.set_zlabel(zLab)
    ax.set_title(title)
    ax.set_xlim(xLim)
    ax.set_ylim(yLim)
    ax.set_zlim(zLim)
    ax.view_init(azim=rot, elev = el) # Set rotation angle
Customers

Location by Trip Duration (Log) - Coordinate View

With X and Y axes as location coordinates, we were able to look at the data in a "Top-down" approach with high elevation in order to see if clusters appeared to target specific locations in NYC. We identified that clusters do not vary much in the northwest regions of the city, but we do see some slight variation in clusters in the southeast region. Cluster 1 appears to have data throughout the whole city, Cluster 2 has very sparse data in the southeast region and very dense data in the rest of the city, and finally Cluster 3 contains nearly no data at all in the southeast region of the city.

In [110]:
custdistclusters = Cust_clus.Cluster_ID_Customer.drop_duplicates().sort_values().values
In [111]:
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()

for i in range(0,len(custdistclusters)):

    x  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_longitude
    y  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_latitude
    z  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
    cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer

    xLim = (Cust_clus.start_station_longitude.min(), Cust_clus.start_station_longitude.max())
    yLim = (Cust_clus.start_station_latitude.min(),  Cust_clus.start_station_latitude.max())
    zLim = (Cust_clus.tripdurationLog.min(),         Cust_clus.tripdurationLog.max())
    
    xLab = 'Longitude (scaled)'
    yLab = 'Latitude (scaled)'
    zLab = 'Trip Duration Log  (scaled)'

    rot = -90
    el = 80
    
    title = "Customer Cluster {0}".format(i+1)
    ax = fig.add_subplot(2,3,i + 1, projection='3d')
    scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)

plt.show()

Location by Trip Duration (Log) - Longitude View

Plotting the data at a 90 degree angle and very slight elevations lets us assess trip durations as start location coordinates move east to west across the map. We see pretty consistent variance in trip duration (log) from east coordinates to west coordinates within each cluster. This may suggest that location within these clusters does not play a huge role in relation to the trip duration. However, when looking at the variance of plotted points across clusters, there is a clear difference between the clusters. Cluster 1 contains trip duration (log) values ranging +/-4, Cluster 2 ranging +/-3, and Cluster 3 ranging +/-2. Interestingly, these differences in variance do not prevent overlapping data values ranges to occur. This suggests that other factors are at play, not plotted here which help to differentiate clusters. We also, once again, see the east coordinate data values dissapear in the third cluster.

In [112]:
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()

for i in range(0,len(custdistclusters)):
    
    x  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_longitude
    y  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_latitude
    z  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
    cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer

    xLim = (Cust_clus.start_station_longitude.min(), Cust_clus.start_station_longitude.max())
    yLim = (Cust_clus.start_station_latitude.min(),  Cust_clus.start_station_latitude.max())
    zLim = (Cust_clus.tripdurationLog.min(),         Cust_clus.tripdurationLog.max())
    
    xLab = 'Longitude (scaled)'
    yLab = 'Latitude (scaled)'
    zLab = 'Trip Duration Log  (scaled)'

    rot = -90
    el = 10

    title = "Customer Cluster {0}".format(i+1)
    ax = fig.add_subplot(2,3,i + 1, projection='3d')
    scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)

plt.show()

Location by Trip Duration (Log) - 3D View

This view, rotated at 60 degrees and an elevation of 45, was definitely the most difficult to interpret. We once again see the consistent variance seen within each cluster previously and the decreasing variance across clusters as you move from cluster 1 to 3. Further insights from this visualization are difficult to ascertain, given the missing dimensions utilized in the model fit.

In [113]:
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()

for i in range(0,len(custdistclusters)):

    x  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_longitude
    y  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_latitude
    z  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
    cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer

    xLim = (Cust_clus.start_station_longitude.min(), Cust_clus.start_station_longitude.max())
    yLim = (Cust_clus.start_station_latitude.min(),  Cust_clus.start_station_latitude.max())
    zLim = (Cust_clus.tripdurationLog.min(),         Cust_clus.tripdurationLog.max())
    
    xLab = 'Longitude (scaled)'
    yLab = 'Latitude (scaled)'
    zLab = 'Trip Duration Log  (scaled)'

    rot = -60
    el = 45

    title = "Customer Cluster {0}".format(i+1)
    ax = fig.add_subplot(2,3,i + 1, projection='3d')
    scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)

plt.show()

Weather by Trip Duration (Log) - Weather View

We have chosen to assess Weather attributes of Average Temperature and Precipitation against trip duration, to identify interesting trends / features across clusters. In our first view, we plot high elevation in order to focus on the presence of data across clusters in varying weather circumstances. Given the fact that majority of rides are taken during non-rainy days, all three clusters contain dense regions around 0 precipitation and warmer temperatures. The interesting takeaway from this data, identifies the different "tolerance" levels of a customer rider. As you move from cluster 1 to cluster 3, you see the tolerance for bad weather (both cold temperatures and high precipitation tolerance) decrease. This may suggest that Customers within cluster 1 are more intense riders. One could speculate that the higher tolerance you have to bad weather as a customer, the more likely you may be to become a subscribing user. The correlation of weather attributes across clusters may be an important factor to identifying our target user segment.

In [114]:
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()

for i in range(0,len(custdistclusters)):

    x  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].PRCP
    y  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].TAVE
    z  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
    cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer

    xLim = (Cust_clus.PRCP.min(),            Cust_clus.PRCP.max())
    yLim = (Cust_clus.TAVE.min(),            Cust_clus.TAVE.max())
    zLim = (Cust_clus.tripdurationLog.min(), Cust_clus.tripdurationLog.max())
    
    xLab = 'Precipitation (scaled)'
    yLab = 'Avg Temp (scaled)'
    zLab = 'Trip Duration Log (scaled))'

    rot = -90
    el = 80

    title = "Customer Cluster {0}".format(i+1)
    ax = fig.add_subplot(2,3,i + 1, projection='3d')
    scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)

plt.show()

Weather by Trip Duration (Log) - Rain View

Once again, we choose to plot the data at a 90 degree angle and very slight elevation. As discussed in the previous view, we see the same trend occuring with precipitation tolerance decreasing across clusters. In addition to this tolerance factor, we also see another interesting difference between clusters, as the range of trip duration values is much higher in clusters with larger weather tolerance. This supports the theory provided earlier, in which cluster 1 observations may be good target candidates.

In [115]:
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()

for i in range(0,len(custdistclusters)):

    x  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].PRCP
    y  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].TAVE
    z  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
    cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer

    xLim = (Cust_clus.PRCP.min(),            Cust_clus.PRCP.max())
    yLim = (Cust_clus.TAVE.min(),            Cust_clus.TAVE.max())
    zLim = (Cust_clus.tripdurationLog.min(), Cust_clus.tripdurationLog.max())

    xLab = 'Precipitation (scaled)'
    yLab = 'Avg Temp (scaled)'
    zLab = 'Trip Duration Log (scaled))'

    rot = -90
    el = 10

    title = "Customer Cluster {0}".format(i+1)
    ax = fig.add_subplot(2,3,i + 1, projection='3d')
    scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)

plt.show()

Weather by Trip Duration (Log) - 3D View

Finally, our 3D rendered view, depicting all three attributes, provides further backing to our previous statements. The density around "optimal weather" conditions is definitely much tighter in cluster 3 in terms of both precipitation and temperatures than that of cluster 2 or cluster 1. The best way to describe these attributes accross clusters is simply levels of tolerance to weather conditions with 1 being highest tolerance(will ride no matter what) and 3 being lowest tolerance(will ride in optimal conditions).

In [116]:
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()

for i in range(0,len(custdistclusters)):

    x  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].PRCP
    y  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].TAVE
    z  = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
    cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer

    xLim = (Cust_clus.PRCP.min(),            Cust_clus.PRCP.max())
    yLim = (Cust_clus.TAVE.min(),            Cust_clus.TAVE.max())
    zLim = (Cust_clus.tripdurationLog.min(), Cust_clus.tripdurationLog.max())
    
    xLab = 'Precipitation (scaled)'
    yLab = 'Avg Temp (scaled)'
    zLab = 'Trip Duration Log (scaled))'

    rot = -60
    el = 45

    title = "Customer Cluster {0}".format(i+1)
    ax = fig.add_subplot(2,3,i + 1, projection='3d')
    scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)

plt.show()
Subscribers

Location by Trip Duration (Log) - Coordinate View

With X and Y axes as location coordinates, we were able to look at the data in a "Top-down" approach with high elevation in order to see if clusters appeared to target specific locations in NYC. With 5 different clusters, we thought we may see much more differentiation by location coordinates than that of the customer clusters. There seems to be a slightly noticable difference between location demographic within Cluster 2 in comparison to others, however the difference is very subtle and potentially caused by the difference in sample size within clusters. All other clusters resemble similar outcomes to one another and no true insights may be drawn as to what locations resemble which clusters.

In [117]:
subdistclusters = Sub_clus.Cluster_ID_Subscriber.drop_duplicates().sort_values().values
In [118]:
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()

for i in range(0,len(subdistclusters)):

    x  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_longitude
    y  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_latitude
    z  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].tripdurationLog
    cs = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].Cluster_ID_Subscriber

    xLim = (Sub_clus.start_station_longitude.min(), Sub_clus.start_station_longitude.max())
    yLim = (Sub_clus.start_station_latitude.min(),  Sub_clus.start_station_latitude.max())
    zLim = (Sub_clus.tripdurationLog.min(),         Sub_clus.tripdurationLog.max())
    
    xLab = 'Longitude (scaled)'
    yLab = 'Latitude (scaled)'
    zLab = 'Trip Duration Log  (scaled)'

    rot = -90
    el = 80
    
    title = "Subscriber Cluster {0}".format(i+1)
    ax = fig.add_subplot(2,3,i + 1, projection='3d')
    scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)

plt.show()

Location by Trip Duration (Log) - Longitude View

Plotting the data at a 90 degree angle and very slight elevations lets us assess trip durations as start location coordinates move east to west across the map. Similar to the customer clusters, we see pretty consistent variance in trip duration (log) from east coordinates to west coordinates within each cluster. One may speculate that these location coordinates may be minimally impactful in our cluster fits and potentially in identifying usertypes as well. Variance in trip duration across clusters appears to change minimally, however it seems relatively unphased by the actual start location.

In [119]:
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()

for i in range(0,len(subdistclusters)):
    
    x  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_longitude
    y  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_latitude
    z  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].tripdurationLog
    cs = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].Cluster_ID_Subscriber

    xLim = (Sub_clus.start_station_longitude.min(), Sub_clus.start_station_longitude.max())
    yLim = (Sub_clus.start_station_latitude.min(),  Sub_clus.start_station_latitude.max())
    zLim = (Sub_clus.tripdurationLog.min(),         Sub_clus.tripdurationLog.max())
    
    xLab = 'Longitude (scaled)'
    yLab = 'Latitude (scaled)'
    zLab = 'Trip Duration Log  (scaled)'

    rot = -90
    el = 10

    title = "Subscriber Cluster {0}".format(i+1)
    ax = fig.add_subplot(2,3,i + 1, projection='3d')
    scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)

plt.show()

Location by Trip Duration (Log) - 3D View

This view, rotated at 60 degrees and an elevation of 45, was once again the most difficult to interpret. We see the consistent variance seen within each cluster as was seen previously, as well as the differing trip duration variance across clusters as you move across clusters. Similar to our customer visualizations by location, further insights are difficult to ascertain, given the missing dimensions utilized in the model fit.

In [120]:
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()

for i in range(0,len(subdistclusters)):

    x  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_longitude
    y  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_latitude
    z  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].tripdurationLog
    cs = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].Cluster_ID_Subscriber

    xLim = (Sub_clus.start_station_longitude.min(), Sub_clus.start_station_longitude.max())
    yLim = (Sub_clus.start_station_latitude.min(),  Sub_clus.start_station_latitude.max())
    zLim = (Sub_clus.tripdurationLog.min(),         Sub_clus.tripdurationLog.max())
    
    xLab = 'Longitude (scaled)'
    yLab = 'Latitude (scaled)'
    zLab = 'Trip Duration Log  (scaled)'

    rot = -60
    el = 45

    title = "Subscriber Cluster {0}".format(i+1)
    ax = fig.add_subplot(2,3,i + 1, projection='3d')
    scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)

plt.show()

Weather by Trip Duration (Log) - Weather View

Once again, we plot high elevation in order to focus on the presence of data across clusters in varying weather circumstances. Interestingly, there is a very clear differentiation between clusters in regards to temperatures. Both clusters 1 and 2, specifically target those subscribing members with a high tolerance to cold weather, whereas Clusters 3,4,5 all focus on warmer weather. This is much different resulting cluster features than what was found in the customer clusters, because of the hard break between clusters. As you look through the clusters targeting warmer weather, you can see a difference in precipitation tolerance with Cluster 4 having largest tolerance to rain and cluster 5 with the least.

In [121]:
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()

for i in range(0,len(subdistclusters)):

    x  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].PRCP
    y  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].TAVE
    z  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].tripdurationLog
    cs = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].Cluster_ID_Subscriber
    
    xLim = (Sub_clus.PRCP.min(),             Sub_clus.PRCP.max())
    yLim = (Sub_clus.TAVE.min(),             Sub_clus.TAVE.max())
    zLim = (Sub_clus.tripdurationLog.min(),  Sub_clus.tripdurationLog.max())
    
    xLab = 'Precipitation (scaled)'
    yLab = 'Avg Temp (scaled)'
    zLab = 'Trip Duration Log (scaled))'

    rot = -90
    el = 80

    title = "Subscriber Cluster {0}".format(i+1)
    ax = fig.add_subplot(2,3,i + 1, projection='3d')
    scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)

plt.show()

Weather by Trip Duration (Log) - 3D View

Although we have identified cluster trends by weather in our subscriber clusters, there does not appear to be much significance in trip duration across clusters. This may be due to the "purpose" subscribers are utilizing the service for. Regardless of rain or shine, our subscriber users are likely making the same routine trips which typically take the same amount of time each trip taken. As was determined by the customer cluster visualizations, we depict these clusters as identifying different tolerance levels amongst weather conditions. With the clear break in clusters present by temperature, and the different levels of precipitation tolerance amongst warmer weather rides - this appears to be a strong component in identifying these clusters.

In [122]:
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()

for i in range(0,len(subdistclusters)):

    x  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].PRCP
    y  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].TAVE
    z  = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].tripdurationLog
    cs = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].Cluster_ID_Subscriber

    xLim = (Sub_clus.PRCP.min(),             Sub_clus.PRCP.max())
    yLim = (Sub_clus.TAVE.min(),             Sub_clus.TAVE.max())
    zLim = (Sub_clus.tripdurationLog.min(),  Sub_clus.tripdurationLog.max())
    
    xLab = 'Precipitation (scaled)'
    yLab = 'Avg Temp (scaled)'
    zLab = 'Trip Duration Log (scaled))'

    rot = -60
    el =  45

    title = "Subscriber Cluster {0}".format(i+1)
    ax = fig.add_subplot(2,3,i + 1, projection='3d')
    scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)

plt.show()

Visualization Summary

Ultimately, given our highly dimensional cluster model fit inputs, it was very difficult to understand what makes a cluster unique from the others. There are many attributes left un-touched in these visualizations due to the limitations on our 3D plots. Within the attributes analyzed, we have come to the conclusion that location does not play a super large role in identifying cluster locations, with the exception of the southeast region of the city. Also, we have determined that one of the core components of the cluster segments is weather tolerance during a ride. Although this factor was deemed important, there are still several other attributes utilized when building these clusters and the interactions amongst these other factors could shed light on many other insights about our cluster segments produced.

Modeling and Evaluation Part 4 - Summarize the Ramifications

As iterated several times over by this point, when pitting K-means, DBSCAN, and Spectral Clustering against one another, Spectral Clustering takes the win for both our customer and subscriber data sets. Reviewing the number of clusters alone, customer and subscriber counts were 7 and 11 for K-means, 49 and 36 for DBSCAN, and 3 and 5 clusters respectively for Spectral Clustering. DBSCAN was simply far too inefficient when including all clustering attributes (stations coordinates excluded) and time consuming, especially as the number of clusters produced was decreased. Then, when comparing the results of K-means against Spectral Clustering, clusters produced by Spectral Clustering exhibited the best cluster cohesion. Getting more specific still, implementing our Spectral Clustering algorithms using Nystrom's Method for eigenvector computation produced truly the best clustering results out of all configurations tested, from both a cohesion and resource perspective.

But what are the ramifications of choosing this Spectral Clustering configuration? The biggest pitfall is that extreme sampling was required in order to make Spectral Clustering possible from a resource perspective. This means clustering for most transactions still remains. In addition to the sampling required, our decision to split the data set back into separate customer and subscriber data sets circumvents our business need to identify would-be/should-be subscribers among customer riders. We recognize the deficiency here but have fabricated a means for mitigating the effects thereof via classification on produced clusters (explored throughout the remainder of this section).

In addition to executing cluster classification for missing values, we will also perform PCA analysis, similar to how we did during Lab 2, using these newly generated groupings. Throughout this additional analysis, we hope to extend description of our results and convey even more compelling findings than we have already throughout Modeling and Evaluation Parts 1-3.

Classification of Clusters

With our clusters fit and appended to the original data, we still do not know what customer cluster our subscriber observations are fit to and vice versa. In order to fit clusters for the inverse user type, we need to build a classification model that can predict the appropriate cluster with strong confidence. For these models, when focusing on accuracy we’ll primarily use confusion matrices to explore our results alongside accuracy percentiles.

We have chosen to utilize Stratified KFold Cross Validation for our classification analysis, with 5 folds. This means, that from our original sample size of 1000, each "fold" will save off approximately 20% as test observations utilizing the rest as training observations all while keeping the ratio of classes equal amongst clusters. This process will occur through 5 iterations, or folds, to allow us to cross validate our results amongst different test/train combinations. We have utilized a random_state seed equal to the length of the original sampled dataset to ensure reproducible results.

In [123]:
fullunpicklepath = "PickleFiles\Sub_clus.pkl"
# Create an variable to pickle and open it in write mode
unpicklefile = open(fullunpicklepath, 'rb')
CitiBike_S_Clust = pickle.load(unpicklefile)
unpicklefile.close()

fullunpicklepath = "PickleFiles\Cust_clus.pkl"
# Create an variable to pickle and open it in write mode
unpicklefile = open(fullunpicklepath, 'rb')
CitiBike_C_Clust = pickle.load(unpicklefile)
unpicklefile.close()

    # Create CV Object for StratifiedKFold with 10 Folds, seeded at the length of our sample size
seed = len(CitiBike_S_Clust)

cv = StratifiedKFold(n_splits = 10, random_state = seed)
print(cv)
StratifiedKFold(n_splits=10, random_state=1000, shuffle=False)

Random Forest Classification

Max Depth The maximum depth (levels) in the tree. When a value is set, the tree may not split further once this level has been met regardless of how many nodes are in the leaf.

Max Features Number of features to consider when looking for a split. Values tested ranged from the default up until the maximum number of features passed in.

Minimum Samples in Leaf Minimum number of samples required to be in a leaf node. Splits may not occur which cause the number of samples in a leaf to be less than this value. Too low a value here leads to overfitting the tree to train data.

Minimum Samples to Split Minimum number of samples required to split a node. Care was taken during parameter tests to keep the ratio between Min Samples in Leaf and Min Samples to Split equal to that of the default values (1:2). This was done to allow an even 50/50 split on nodes which match the lowest granularity split criteria. Similar to the min samples in leaf, too low a value here leads to overfitting the tree to train data.

n_estimators Number of Trees generated in the forest. Increasing the number of trees, in our models increased accuracy while decreasing performance. We tuned to provide output that completed all 10 iterations in under 10 minutes.

In [124]:
%%time


def rfc_explor(ScaledData,
               n_estimators,
               max_features,
               max_depth, 
               min_samples_split,
               min_samples_leaf,
               y,
               cv          = cv,
               seed        = seed):
    startTime = datetime.now()

    X = ScaledData
    
    rfc_clf = RandomForestClassifier(n_estimators=n_estimators, max_features = max_features, max_depth=max_depth, min_samples_split = min_samples_split, min_samples_leaf = min_samples_leaf, n_jobs=-1, random_state = seed) # get object
    
    accuracy = cross_val_score(rfc_clf, X, y, cv=cv.split(X, y)) # this also can help with parallelism
    MeanAccuracy =  sum(accuracy)/len(accuracy)
    accuracy = np.append(accuracy, MeanAccuracy)
    endTime = datetime.now()
    TotalTime = endTime - startTime
    accuracy = np.append(accuracy, TotalTime)
    
    #print(TotalTime)
    #print(accuracy)
    
    return accuracy
Wall time: 0 ns
In [125]:
def plot_confusion_matrix(cm, 
                          classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.rcParams['figure.figsize'] = (12, 6)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    

    thresh = cm.max() / 2.
    
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, round(cm[i, j],4),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

    plt.show()
In [126]:
%%time

def compute_kfold_scores_Classification( clf,  
                                         ScaledData,
                                         y,
                                         classes,
                                         cv       = cv):
    

    X = ScaledData.as_matrix() 


    # Run classifier with cross-validation

    accuracy = []
    
    for (train, test) in cv.split(X, y):
        clf.fit(X[train],y[train])  # train object
        y_hat = clf.predict(X[test]) # get test set preditions
        
        
        a = float(mt.accuracy_score(y[test],y_hat))
       
        accuracy.append(round(a,5)) 

   
    print("Accuracy Ratings across all iterations: {0}\n\n\
Average Accuracy: {1}\n".format(accuracy, round(sum(accuracy)/len(accuracy),5)))

    print("confusion matrix\n{0}\n".format(pd.crosstab(y[test], y_hat, rownames = ['True'], colnames = ['Predicted'], margins = True)))   

        # Plot non-normalized confusion matrix
    plt.figure()
    plot_confusion_matrix(confusion_matrix(y[test], y_hat), 
                          classes   =classes, 
                          normalize =True,
                          title     ='Confusion matrix, with normalization')
    
    return clf, accuracy
Wall time: 0 ns

Train Cluster Parameters for Subscribers

After 14 iterations of modifying the above parameters, we land on a final winner based on the highest average Accuracy value across all iterations. Average Accuracy values in our 10 test/train iterations ranged from 87.6965 % with our worst parameter inputs of the random forest classification model to a value of 89.3968 % in the best tuned model fit. Also shown below is a stacked line chart of all accuracies across the 10 iterations, with different lines per parameter input. For all inputs provided, we saw less consistency than was desired - in our winning parameter inputs, we saw an accuracy metric ranging from 86 % - 94.0594 %. Part of this problem could be due to the very small sample size fit through spectral clustering. With more computational resources and time, we could have created a larger fit dataset to these clusters for training/testing operations. Given this report's focus on Clustering, and not on classification we have chosen to only train one model type. In a real-world setting, we would test various other classification methodologies to identify a model with the best consistent fit.

Parameter inputs for the final Random Forest Classification model with the KD Tree Algorithm are as follows:

Subscriber Parameters

max_depth max_features min_samples_leaf min_samples_split n_estimators
None Auto 1 2 15
In [127]:
%%time
#os.remove("PickleFiles/rfcdf_Subsc.pkl")

if os.path.isfile("PickleFiles/rfcdf_Subsc.pkl"):
    print("File already created")
    rfcdf_Subsc = unpickleObject("rfcdf_Subsc")
else: 
    acclist = [] 

    n_estimators       =  [10    , 10    , 10   , 10  , 10   , 10  , 10  , 10      , 10     , 10     , 10    , 5       , 15      , 25     ]  
    max_features       =  ['auto', 'auto', 8    , 12  , 16   , 20  , 22  , 'auto'  , 'auto' , 'auto' , 'auto', 'auto'  , 'auto'  , 'auto' ] 
    max_depth          =  [None  , None  , None , None, None , None, None, 100     , 50     , 25     , 10    , None    , None    , None   ] 
    min_samples_split  =  [2     , 8     , 2    , 2   , 2    , 2   , 2   , 2       , 2      , 2      , 2     , 2       , 2       , 2      ] 
    min_samples_leaf   =  [1     , 4     , 1    , 1   , 1    , 1   , 1   , 1       , 1      , 1      , 1     , 1       , 1       , 1      ]

    for i in range(0,len(n_estimators)):
        acclist.append(rfc_explor(ScaledData        = CitiBike_S_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1),
                                  n_estimators      = n_estimators[i],
                                  max_features      = max_features[i],
                                  max_depth         = max_depth[i],
                                  min_samples_split = min_samples_split[i],
                                  min_samples_leaf  = min_samples_leaf[i],
                                  y = CitiBike_S_Clust["Cluster_ID_Subscriber"].values
                                 )
                      )

    rfcdf_Subsc = pd.DataFrame(pd.concat([pd.DataFrame({
                                                    "n_estimators": n_estimators,          
                                                    "max_features": max_features,         
                                                    "max_depth": max_depth,        
                                                    "min_samples_split": min_samples_split,
                                                    "min_samples_leaf": min_samples_leaf   
                                                  }),
                                   pd.DataFrame(acclist)], axis = 1).reindex())
    rfcdf_Subsc.columns = ['max_depth', 'max_features', 'min_samples_leaf','min_samples_split', 'n_estimators', 'Iteration 0', 'Iteration 1', 'Iteration 2', 'Iteration 3', 'Iteration 4', 'Iteration 5', 'Iteration 6', 'Iteration 7', 'Iteration 8', 'Iteration 9', 'MeanAccuracy', 'RunTime']

    pickleObject(rfcdf_Subsc, "rfcdf_Subsc", filepath = "PickleFiles/")

display(rfcdf_Subsc)


plot = rfcdf_Subsc[["Iteration 0","Iteration 1","Iteration 2","Iteration 3","Iteration 4","Iteration 5","Iteration 6","Iteration 7","Iteration 8","Iteration 9"]].transpose().plot.line(title = "Accuracies for Varying Parameters Across 10 Iterations",rot=45)
plot.set_xlabel("Iterations")
plot.set_ylabel("Accuracies")
File already created
max_depth max_features min_samples_leaf min_samples_split n_estimators Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7 Iteration 8 Iteration 9 MeanAccuracy RunTime
0 NaN auto 1 2 10 0.891089 0.831683 0.940594 0.89 0.86 0.85 0.88 0.91 0.90 0.907216 0.886058 00:00:02.260028
1 NaN auto 4 8 10 0.891089 0.851485 0.920792 0.86 0.86 0.86 0.88 0.86 0.91 0.876289 0.876965 00:00:02.311024
2 NaN 8 1 2 10 0.891089 0.851485 0.950495 0.91 0.89 0.85 0.86 0.89 0.92 0.907216 0.892029 00:00:02.204510
3 NaN 12 1 2 10 0.871287 0.831683 0.940594 0.95 0.87 0.83 0.89 0.86 0.93 0.896907 0.887047 00:00:02.199601
4 NaN 16 1 2 10 0.891089 0.851485 0.940594 0.93 0.85 0.89 0.87 0.88 0.90 0.886598 0.888977 00:00:02.216635
5 NaN 20 1 2 10 0.910891 0.851485 0.960396 0.92 0.88 0.82 0.89 0.92 0.91 0.886598 0.894937 00:00:02.345323
6 NaN 22 1 2 10 0.871287 0.841584 0.960396 0.94 0.86 0.85 0.88 0.92 0.91 0.886598 0.891987 00:00:02.241465
7 100.0 auto 1 2 10 0.891089 0.831683 0.940594 0.89 0.86 0.85 0.88 0.91 0.90 0.907216 0.886058 00:00:02.300307
8 50.0 auto 1 2 10 0.891089 0.831683 0.940594 0.89 0.86 0.85 0.88 0.91 0.90 0.907216 0.886058 00:00:02.295318
9 25.0 auto 1 2 10 0.891089 0.831683 0.940594 0.89 0.86 0.85 0.88 0.91 0.90 0.907216 0.886058 00:00:02.270715
10 10.0 auto 1 2 10 0.910891 0.861386 0.950495 0.91 0.87 0.81 0.91 0.90 0.90 0.907216 0.892999 00:00:02.243954
11 NaN auto 1 2 5 0.831683 0.821782 0.881188 0.86 0.84 0.86 0.86 0.88 0.89 0.886598 0.861125 00:00:02.219607
12 NaN auto 1 2 15 0.940594 0.871287 0.930693 0.91 0.89 0.86 0.88 0.91 0.89 0.876289 0.895886 00:00:02.347737
13 NaN auto 1 2 25 0.930693 0.861386 0.930693 0.92 0.91 0.84 0.88 0.88 0.89 0.896907 0.893968 00:00:02.484979
Wall time: 2.45 s

Train Cluster Parameters for Customers

After 14 iterations of modifying the above parameters, we land on a final winner based on the highest average Accuracy value across all iterations. Average Accuracy values in our 10 test/train iterations ranged from 66.5054 % with our worst parameter inputs of the random forest classification model to a value of 73.7131 % in the best tuned model fit. Also shown below is a stacked line chart of all accuracies across the 10 iterations, with different lines per parameter input. For all inputs provided, we saw less consistency than was desired - in our winning parameter inputs, we saw an accuracy metric ranging from 63.7255 % - 81.1881 %. Part of this problem could be due to the very small sample size fit through spectral clustering. With more computational resources and time, we could have created a larger fit dataset to these clusters for training/testing operations. Given this report's focus on Clustering, and not on classification we have chosen to only train one model type. In a real-world setting, we would test various other classification methodologies to identify a model with the best consistent fit.

Parameter inputs for the final Random Forest Classification model with the KD Tree Algorithm are as follows:

Customer Parameters

max_depth max_features min_samples_leaf min_samples_split n_estimators
None auto 4 8 25
In [128]:
%%time
#os.remove("PickleFiles/rfcdf_Cust.pkl")

if os.path.isfile("PickleFiles/rfcdf_Cust.pkl"):
    print("File already created")
    rfcdf_Cust = unpickleObject("rfcdf_Cust")
else: 
    acclist = [] 

    n_estimators       =  [10    , 10    , 10    , 10   , 10  , 10   , 10  , 10  , 10      , 10    , 10    , 10    , 5     , 15    , 25     ]  
    max_features       =  ['auto', 'auto', 'auto', 8    , 12  , 16   , 20  , 22  , 'auto'  , 'auto', 'auto', 'auto', 'auto', 'auto', 'auto' ] 
    max_depth          =  [None  , None  , None  , None , None, None , None, None, 100     , 50    , 25    , 10    , None  , None  , None   ] 
    min_samples_split  =  [2     , 8     , 12    , 8    , 8   , 8    , 8   , 8   , 8       , 8     , 8     , 8     , 8     , 8     , 8      ] 
    min_samples_leaf   =  [1     , 4     , 6     , 4    , 4   , 4    , 4   , 4   , 4       , 4     , 4     , 4     , 4     , 4     , 4      ]

    for i in range(0,len(n_estimators)):
        acclist.append(rfc_explor(ScaledData        = CitiBike_C_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1),
                                  n_estimators      = n_estimators[i],
                                  max_features      = max_features[i],
                                  max_depth         = max_depth[i],
                                  min_samples_split = min_samples_split[i],
                                  min_samples_leaf  = min_samples_leaf[i],
                                  y = CitiBike_C_Clust["Cluster_ID_Customer"].values
                                 )
                      )

    rfcdf_Cust = pd.DataFrame(pd.concat([pd.DataFrame({
                                                    "n_estimators": n_estimators,          
                                                    "max_features": max_features,         
                                                    "max_depth": max_depth,        
                                                    "min_samples_split": min_samples_split,
                                                    "min_samples_leaf": min_samples_leaf   
                                                  }),
                                   pd.DataFrame(acclist)], axis = 1).reindex())
    rfcdf_Cust.columns = ['max_depth', 'max_features', 'min_samples_leaf','min_samples_split', 'n_estimators', 'Iteration 0', 'Iteration 1', 'Iteration 2', 'Iteration 3', 'Iteration 4', 'Iteration 5', 'Iteration 6', 'Iteration 7', 'Iteration 8', 'Iteration 9', 'MeanAccuracy', 'RunTime']

    pickleObject(rfcdf_Cust, "rfcdf_Cust", filepath = "PickleFiles/")
    
display(rfcdf_Cust)

plot = rfcdf_Cust[["Iteration 0","Iteration 1","Iteration 2","Iteration 3","Iteration 4","Iteration 5","Iteration 6","Iteration 7","Iteration 8","Iteration 9"]].transpose().plot.line(title = "Accuracies for Varying Parameters Across 10 Iterations",rot=45)
plot.set_xlabel("Iterations")
plot.set_ylabel("Accuracies")
File already created
max_depth max_features min_samples_leaf min_samples_split n_estimators Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7 Iteration 8 Iteration 9 MeanAccuracy RunTime
0 NaN auto 1 2 10 0.656863 0.752475 0.712871 0.68 0.69 0.69 0.747475 0.727273 0.767677 0.666667 0.709130 00:00:02.303866
1 NaN auto 4 8 10 0.637255 0.772277 0.693069 0.69 0.68 0.63 0.717172 0.646465 0.787879 0.727273 0.698139 00:00:02.211724
2 NaN auto 6 12 10 0.598039 0.712871 0.663366 0.69 0.70 0.63 0.727273 0.757576 0.757576 0.595960 0.683266 00:00:02.237937
3 NaN 8 4 8 10 0.647059 0.722772 0.702970 0.78 0.70 0.71 0.757576 0.727273 0.717172 0.686869 0.715169 00:00:02.270112
4 NaN 12 4 8 10 0.666667 0.762376 0.712871 0.72 0.73 0.67 0.797980 0.727273 0.727273 0.686869 0.720131 00:00:02.272313
5 NaN 16 4 8 10 0.637255 0.722772 0.752475 0.71 0.72 0.67 0.777778 0.818182 0.777778 0.737374 0.732361 00:00:02.286604
6 NaN 20 4 8 10 0.666667 0.732673 0.762376 0.71 0.67 0.68 0.797980 0.797980 0.797980 0.686869 0.730252 00:00:02.300785
7 NaN 22 4 8 10 0.637255 0.702970 0.722772 0.74 0.70 0.67 0.787879 0.797980 0.767677 0.717172 0.724370 00:00:02.298785
8 100.0 auto 4 8 10 0.637255 0.772277 0.693069 0.69 0.68 0.63 0.717172 0.646465 0.787879 0.727273 0.698139 00:00:02.260422
9 50.0 auto 4 8 10 0.637255 0.772277 0.693069 0.69 0.68 0.63 0.717172 0.646465 0.787879 0.727273 0.698139 00:00:02.271674
10 25.0 auto 4 8 10 0.637255 0.772277 0.693069 0.69 0.68 0.63 0.717172 0.646465 0.787879 0.727273 0.698139 00:00:02.311992
11 10.0 auto 4 8 10 0.617647 0.762376 0.752475 0.66 0.70 0.67 0.757576 0.707071 0.787879 0.686869 0.710189 00:00:02.220918
12 NaN auto 4 8 5 0.578431 0.772277 0.643564 0.68 0.71 0.64 0.676768 0.555556 0.676768 0.717172 0.665054 00:00:02.138857
13 NaN auto 4 8 15 0.627451 0.752475 0.792079 0.73 0.72 0.65 0.707071 0.696970 0.737374 0.747475 0.716089 00:00:02.344371
14 NaN auto 4 8 25 0.637255 0.811881 0.762376 0.75 0.73 0.70 0.757576 0.737374 0.747475 0.737374 0.737131 00:00:02.425082
Wall time: 197 ms
Random Forest - Analyze the Results

We have created a function to be used for our cross-validation Accuracy Scores. Model CLF object, original sample y values, a distinct list of classes, and a CV containing our test/train splits allow us to easily produce an array of Accuracy Scores. Finally, a confusion matrix is displayed for the last test/train iteration for further interpretation on results.

With our tuned parameters identified we may now assess futher insights. Subscribers: As was discussed earlier, we see a fairly large range in accuracies, however the average accuracy rating of 89.589 % is fairly decent across all 10 iterations. In a confusion matrix of predicted results, we find that the most difficult cluster to predict and the cluster causing the most incorrect predictions for other classes is cluster 4. You can see this clearly in the gradient plot below as we have a larger density colors on cluster 4 false positives and incorrect cluster 4 predictions in comparison to other clusters. Customers: The customer demographic had much less accurate predictions despite the decrease in number of clusters. We once again see a large range in accuracies, with an average accuracy of 73.713 %. As is seen in the confusion matrix, we pretty consistently guess each cluster correct / incorrectly, and unlike the subscriber cluster set we do not see any potential trend for incorrect predictions off the bat. Although ~73% accuracy is not ideal, it is still more than 2x better than chance(33%), so we are comfortable with moving on with this model for this report.

Once again, in a real-world setting, we would test various other classification methodologies and utilize more resources to increase sample size in order to identify stronger cluster fits and a model with the best consistent fit.

In [129]:
%%time

rfc_clf = RandomForestClassifier(n_estimators       = 15    ,
                                 max_features       = 'auto',
                                 max_depth          = None  , 
                                 min_samples_split  = 2     ,
                                 min_samples_leaf   = 1     ,
                                 n_jobs             = -1    , 
                                 random_state       = seed) # get object
    
rfc_clf_Subscriber, rfc_acc_Subscriber = compute_kfold_scores_Classification(clf         = rfc_clf,
                                                       ScaledData  = CitiBike_S_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1),
                                                       y           = CitiBike_S_Clust["Cluster_ID_Subscriber"].values,
                                                       classes     = CitiBike_S_Clust["Cluster_ID_Subscriber"].drop_duplicates().sort_values().values
                                                      )
Accuracy Ratings across all iterations: [0.94059, 0.87129, 0.93069, 0.91, 0.89, 0.86, 0.88, 0.91, 0.89, 0.87629]

Average Accuracy: 0.89589

confusion matrix
Predicted   1  2   3   4   5  All
True                             
1          18  0   0   1   0   19
2           0  7   0   0   0    7
3           0  0  25   4   0   29
4           4  0   3  22   0   29
5           0  0   0   0  13   13
All        22  7  28  27  13   97

Normalized confusion matrix
Wall time: 4.07 s
In [130]:
%%time

rfc_clf = RandomForestClassifier(n_estimators       = 25    ,
                                 max_features       = 'auto',
                                 max_depth          = None  , 
                                 min_samples_split  = 8     ,
                                 min_samples_leaf   = 4     ,
                                 n_jobs             = -1    , 
                                 random_state       = seed) # get object
    
rfc_clf_Customer, rfc_acc_Customer = compute_kfold_scores_Classification(clf         = rfc_clf,
                                                       ScaledData  = CitiBike_C_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1),
                                                       y           = CitiBike_C_Clust["Cluster_ID_Customer"].values,
                                                       classes     = CitiBike_C_Clust["Cluster_ID_Customer"].drop_duplicates().sort_values().values
                                                      )
Accuracy Ratings across all iterations: [0.63725, 0.81188, 0.76238, 0.75, 0.73, 0.7, 0.75758, 0.73737, 0.74747, 0.73737]

Average Accuracy: 0.73713

confusion matrix
Predicted   1   2   3  All
True                      
1          18   8   0   26
2           4  33   7   44
3           0   7  22   29
All        22  48  29   99

Normalized confusion matrix
Wall time: 2.68 s
Random Forest - Predict Cluster Values for Inverse UserType Observations
In [131]:
CitiBike_WithClusters = pd.concat([CitiBike_C, CitiBike_S])
X = CitiBike_WithClusters.drop(["usertype"], axis=1)

CitiBike_WithClusters["Cluster_ID_Customer"]   = rfc_clf_Customer.predict(X)
CitiBike_WithClusters["Cluster_ID_Subscriber"] = rfc_clf_Subscriber.predict(X)

display(CitiBike_WithClusters.head())
display(CitiBike_WithClusters.tail())
start_station_latitude start_station_longitude end_station_latitude end_station_longitude PRCP SNOW TAVE TMAX TMIN DayOfWeek_Friday DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday TimeOfDay_Afternoon TimeOfDay_Evening TimeOfDay_Midday TimeOfDay_Morning TimeOfDay_Night tripdurationLog usertype Cluster_ID_Customer Cluster_ID_Subscriber
0 -1.945869 1.118504 -0.866803 2.396808 -0.305915 -0.069606 1.285221 1.258632 1.319126 -0.399214 -0.385424 -0.472244 2.267759 -0.381894 -0.385824 -0.388611 1.845237 -0.694351 -0.544202 -0.407459 -0.288457 1.187103 Customer 1 4
1 0.427429 -0.453045 0.585415 -0.808549 -0.305915 -0.069606 -0.464291 -0.631718 -0.316163 -0.399214 -0.385424 -0.472244 -0.440964 -0.381894 -0.385824 2.573270 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 -0.966816 Customer 2 3
2 -1.217379 -1.997279 -1.051587 -1.867364 -0.305915 -0.069606 0.961237 0.754539 1.122892 -0.399214 -0.385424 -0.472244 2.267759 -0.381894 -0.385824 -0.388611 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 -0.314528 Customer 3 4
3 1.443002 1.104047 1.270051 1.338197 4.087822 -0.069606 0.637254 0.313457 0.926657 -0.399214 2.594548 -0.472244 -0.440964 -0.381894 -0.385824 -0.388611 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 -1.159576 Customer 1 3
4 0.383217 0.639000 0.391566 0.632595 0.007923 -0.069606 0.507660 0.376469 0.665011 -0.399214 -0.385424 2.117550 -0.440964 -0.381894 -0.385824 -0.388611 -0.541936 -0.694351 1.837552 -0.407459 -0.288457 -1.901593 Customer 3 4
start_station_latitude start_station_longitude end_station_latitude end_station_longitude PRCP SNOW TAVE TMAX TMIN DayOfWeek_Friday DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday TimeOfDay_Afternoon TimeOfDay_Evening TimeOfDay_Midday TimeOfDay_Morning TimeOfDay_Night tripdurationLog usertype Cluster_ID_Customer Cluster_ID_Subscriber
499995 0.894884 0.761537 1.293739 2.047810 0.112536 -0.069606 0.766847 0.628516 0.926657 -0.399214 -0.385424 -0.472244 -0.440964 -0.381894 2.591856 -0.388611 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 -0.121459 Subscriber 2 3
499996 0.087561 -1.041187 -0.340209 0.127857 -0.305915 -0.069606 -1.047461 -1.135811 -0.970279 -0.399214 -0.385424 -0.472244 2.267759 -0.381894 -0.385824 -0.388611 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 0.799127 Subscriber 2 3
499997 1.501576 0.274900 1.043263 0.868015 -0.305915 -0.069606 0.507660 0.502492 0.468776 -0.399214 -0.385424 2.117550 -0.440964 -0.381894 -0.385824 -0.388611 -0.541936 -0.694351 -0.544202 2.454233 -0.288457 -0.796115 Subscriber 1 5
499998 0.411847 -1.010167 0.823976 -0.258512 -0.253609 -0.069606 -0.464291 -0.505695 -0.381574 -0.399214 -0.385424 2.117550 -0.440964 -0.381894 -0.385824 -0.388611 -0.541936 -0.694351 -0.544202 -0.407459 3.466716 -0.381556 Subscriber 1 4
499999 -1.856080 -0.351114 -2.122966 0.177918 -0.253609 -0.069606 0.054083 -0.253648 0.403364 -0.399214 -0.385424 -0.472244 -0.440964 2.618529 -0.385824 -0.388611 -0.541936 -0.694351 -0.544202 2.454233 -0.288457 -1.261533 Subscriber 1 5
PCA Loadings, do our newly added Cluster features add value?

Our first objective is to identify the number of components to be used for user type classification. In order to do so, we first exclude redundant and non-value variables up front. Non-value variables include gender, birth year, and age since these data were missing for most Customer user types and were replaced with filler values as discussed in previous sections. We will exclude these since they misrepresent correlation with user type.

(Note: PCA code steps adapted and modified from https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/)

In [132]:
%%time
    #Split Cluster_ID_Customer Values
AttSplit = pd.get_dummies(CitiBike_WithClusters.Cluster_ID_Customer,prefix='Cluster_ID_Customer')
CitiBike_WithClusters = pd.concat((CitiBike_WithClusters,AttSplit),axis=1) # add back into the dataframe

    #Split Cluster_ID_Subscriber Values 
AttSplit = pd.get_dummies(CitiBike_WithClusters.Cluster_ID_Subscriber,prefix='Cluster_ID_Subscriber')
CitiBike_WithClusters = pd.concat((CitiBike_WithClusters,AttSplit),axis=1) # add back into the dataframe

display(CitiBike_WithClusters.head())

myData_scaled_classification = CitiBike_WithClusters.drop(["usertype", "Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis = 1).as_matrix()
print(myData_scaled_classification.shape)

maxcomp = 30

pca_class = PCA(n_components=maxcomp, svd_solver='randomized')

pca_class.fit(myData_scaled_classification)
start_station_latitude start_station_longitude end_station_latitude end_station_longitude PRCP SNOW TAVE TMAX TMIN DayOfWeek_Friday DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday TimeOfDay_Afternoon TimeOfDay_Evening TimeOfDay_Midday TimeOfDay_Morning TimeOfDay_Night tripdurationLog usertype Cluster_ID_Customer Cluster_ID_Subscriber Cluster_ID_Customer_1 Cluster_ID_Customer_2 Cluster_ID_Customer_3 Cluster_ID_Subscriber_1 Cluster_ID_Subscriber_2 Cluster_ID_Subscriber_3 Cluster_ID_Subscriber_4 Cluster_ID_Subscriber_5
0 -1.945869 1.118504 -0.866803 2.396808 -0.305915 -0.069606 1.285221 1.258632 1.319126 -0.399214 -0.385424 -0.472244 2.267759 -0.381894 -0.385824 -0.388611 1.845237 -0.694351 -0.544202 -0.407459 -0.288457 1.187103 Customer 1 4 1 0 0 0 0 0 1 0
1 0.427429 -0.453045 0.585415 -0.808549 -0.305915 -0.069606 -0.464291 -0.631718 -0.316163 -0.399214 -0.385424 -0.472244 -0.440964 -0.381894 -0.385824 2.573270 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 -0.966816 Customer 2 3 0 1 0 0 0 1 0 0
2 -1.217379 -1.997279 -1.051587 -1.867364 -0.305915 -0.069606 0.961237 0.754539 1.122892 -0.399214 -0.385424 -0.472244 2.267759 -0.381894 -0.385824 -0.388611 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 -0.314528 Customer 3 4 0 0 1 0 0 0 1 0
3 1.443002 1.104047 1.270051 1.338197 4.087822 -0.069606 0.637254 0.313457 0.926657 -0.399214 2.594548 -0.472244 -0.440964 -0.381894 -0.385824 -0.388611 -0.541936 1.440195 -0.544202 -0.407459 -0.288457 -1.159576 Customer 1 3 1 0 0 0 0 1 0 0
4 0.383217 0.639000 0.391566 0.632595 0.007923 -0.069606 0.507660 0.376469 0.665011 -0.399214 -0.385424 2.117550 -0.440964 -0.381894 -0.385824 -0.388611 -0.541936 -0.694351 1.837552 -0.407459 -0.288457 -1.901593 Customer 3 4 0 0 1 0 0 0 1 0
(500000, 30)
Wall time: 12.9 s

Above, We verify that 30 attributes exist in our dataset, after removed features for clustering and the addition of our cluster values. The maximum number of components to be produced will match this number. For this reason, we will identify 30 to be the number of components produced by our PCA and will review each component's explained variance further to determine the proper number of components to be included later during model generation. Note randomized PCA was chosen in order to use singular value decomposition in our dimensionality reduction efforts due to the large size of our data set. Using full PCA required unacceptable lengths of time to compute.

Below, the resulting components have been ordered by eigenvector value and these values portrayed as ratios of variance explained by each component. In order to identify the principal components to be included during model generation, we review the rate at which explained variance decreases in significance from one principal component to the next. Accompanying these proportion values is a scree plot representing these same values in visual form. By plotting the scree plot, it is easier to judge where this rate of decreasing explained variance occurs. Interestingly, the rate of change in explained variance among principal components has two "stairs", in the scree plot. This makes interpreting the number of components slightly more challenging, as one could argue either ~PC 5 or ~PC 17 for the point in which the plot levels off to marginal decreased explained variance.

In [133]:
#The amount of variance that each PC explains
var= pca_class.explained_variance_ratio_

plt.rcParams['figure.figsize'] = (12, 6)

sns.set(font_scale=1)
plt.plot(range(1,maxcomp+1), var*100, marker = '.', color = 'red', markerfacecolor = 'black')
plt.xlabel('Principal Components')
plt.ylabel('Percentage of Explained Variance')
plt.title('Scree Plot')
plt.axis([0, maxcomp+1, -0.1, 20])

np.set_printoptions(suppress=True)
print(np.round(var, decimals=4)*100)
[ 14.13   8.46   6.92   6.49   5.59   5.43   5.26   5.21   4.99   4.95
   4.92   4.71   4.51   4.41   3.82   3.37   2.58   1.41   1.05   0.68
   0.52   0.27   0.16   0.13   0.03   0.     0.     0.     0.     0.  ]

Now referring to the cumulative variance values and associated plot below, it may be seen that the cumulative variance arguably begins to plateau around the 17th principal component. 17th principal component and that the first 17 components together explain 95.75% of variance in the data set. For this reason, 17 principal components may be selected as being the most appropriate for user type classification modeling given the variables among these data.**

In [134]:
#Cumulative Variance explains
var1=np.cumsum(np.round(pca_class.explained_variance_ratio_, decimals=4)*100)

plt.rcParams['figure.figsize'] = (12, 6)

plt.plot(range(1,maxcomp+1), var1, marker = '.', color = 'green', markerfacecolor = 'black')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance (Sum %)')
plt.title('Cumulative Variance Plot')
plt.axis([0, maxcomp+1, 10, 101])

print(var1)
[ 14.13  22.59  29.51  36.    41.59  47.02  52.28  57.49  62.48  67.43
  72.35  77.06  81.57  85.98  89.8   93.17  95.75  97.16  98.21  98.89
  99.41  99.68  99.84  99.97 100.   100.   100.   100.   100.   100.  ]
Classification PCA Loadings

Now at the individual principal component level, each classification principal component has 30 loadings which may be esteemed as weights or coefficients representing each original attribute. These loadings represent to what extent each attribute fabricates a given principal component, and the relationship between these attributes in context of the principal component under review. Another perspective is that each principal component is describing an underlying factor which is comprised of the heaviest loadings.

Rather than discuss all 30 principal components, we will instead focus on the two principal components with what we percieve as interesting loadings in respect to cluster values (PC1, PC4) within the first 17 Principal components. Each plot below has identified the top 15 attributes, both positive and negative, which describe the component analyzed.

Principal Component 1 - Early Birds, Weekends, Mornings

We were happy to see 6 of the top 15 attributes within the first principal component, identified by our newly addeed customer / subscriber clusters. Interestingly, weather plays a big part in the relationship to these cluster values in PC1. With Temperature loadings approx. -0.51, we see negative correlations with SNOW(expected, as lower temperatures will produce a higher chance of snow) and our Subscriber/Customer Cluster 1 values. We found it interesting that we see consistent signs across the varying usertype clusters. Given the fact that Subscriber/Customer Cluster 1 are both positively signed at approx. .11, we may expect these clusters to react similarly in different weather conditions, mornings, and weekends. With negative correlation to Temperature, Mornings, and Weekend days - one might speculate this may be due to routine morning subscriber / Customer Cluster 1 routes in comparison to Subscriber Cluster 3,4 & Customer Cluster 2,3 excursions during good weather and / or weekends.

In [135]:
components = pd.Series(pca_class.components_[0], index=CitiBike_WithClusters.drop(["usertype", "Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis = 1).columns)

maxcomponentsix = pd.Series(pd.DataFrame(abs(components).sort_values(ascending=False).head(15)).index)

matplotlib.rc('xtick', labelsize=8)

plt.rcParams['figure.figsize'] = (20, 8)
sns.set(font_scale=1.2)
weightsplot = pd.Series(components, index=maxcomponentsix)
weightsplot.plot(kind='bar', color = 'Tomato')
plt.show()

Principal Component 4 - Early Birds Vs Evening Rides

The second Principal Component we will discuss (PC4), identifies 6 of the top 15 attributes as our newly added cluster attributes. PC4 appears to rely heavily on the variance amongst ride start time. It appears as though the negative signs for Subscriber cluster 3 and Customer Clusters 3,4 match indicating their rider preference to be evening rides vs. mornings or nights. On the other side, we see that our positively signed Customer Cluster 1 and Subscriber clusters 4,5 all indicate preference towards morning and night rides over evening rides regardless of rain. This overlap, across cluster types, could be impactful towards identifying where customer behaviour resembles that of a subscriber.

In [136]:
components = pd.Series(pca_class.components_[3], index=CitiBike_WithClusters.drop(["usertype", "Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis = 1).columns)
maxcomponentsix = pd.Series(pd.DataFrame(abs(components).sort_values(ascending=False).head(15)).index)

matplotlib.rc('xtick', labelsize=8)

plt.rcParams['figure.figsize'] = (20, 8)
sns.set(font_scale=1.2)
weightsplot = pd.Series(components, index=maxcomponentsix)
weightsplot.plot(kind='bar', color = 'SlateBlue')
plt.show()

Deployment

How useful is the model for interested parties?

Classification of Usertype (Customer vs Subscriber)

How is this model useful? Citibike aims to increase rider subscriptions, but may fear adverts to the wrong people could keep customers from enjoying their services. A classification model predicting usertype as customer or subscriber allows Citibike to identify those customers who meet the criteria for common subscribing members. False Positive predictions are extremely valueable to target this demographic.

How does the cluster features produced impact the model? As previously discussed, we added two features to our dataset through spectral clustering. These features identified clusters based on both the subscriber and customer demographics. With the clusters identified, we were then capable of building classification models for each in order to append the cluster values to the entire dataset. Once these features were added to the dataset, running Principal Component Analysis identified several principal components which were heavily impacted by these new features. Our hopes, although not performed during this report, is that the addition of these cluster features will positively impact the accuracy of a classification model on predicting usertypes.

How would this model be deployed? This model has the ability to be deployed as real-time predictions, or as a periodic corporate marketing alert if customer email addresses are readily available. Our preference would be a real-time prediction as a customer returns a bike. Upon return, if the model suggests a customer contains subcribing tendencies email alerts, pop-up promotions,etc. may be deployed in order to gain the customer's attention towards subscriber offerings. Alternatively, the marketing team could receive this information periodically, and implement custom strategies based on industry best practice techniques to this target group. This group of individuals are likely more apt to acknowledge these tactics positively, since their usage tendencies already align more closely to that of a subcribing member.

How often would the model need to be updated? This model would definitely need to be updated periodically. As the citibike Share service grows, and new locations arrive, the model will need to be updated to account for the new sites. Also, as the population in NYC shifts over time(new businesses, schools, residential, etc.), trends may also fluctuate. These fluctuations will need to be accounted for in the model regularly to keep it up to date with current state NYC. Our recommendation for these updates would be periodic (monthly or quarterly) model fit updates in CitiBike systems to account for these possible changes.

Additional Data to Collect:
  • Event/Restaurant/Retail Data: Given that we have detailed geocoordinate data and have already demonstrated powerful use of the Google Maps API, it would be possible to incorporate location details surrounding Citi Bike start and stop locations. There is potential for such data to be gathered automatically using API's such as Google's. Having this data would provide further insight into some of the reasons some bike share locations are more popular than others. Such data could even help Citi Bike predict changes in station demand based on changing surroundings and help determine where new stations should be installed.
  • Special Events: Similar to the previous idea, merging other public data based on geophysical coordinates and timeline could introduce other external factors such as the effects of parades, public concerts, festivals, and other events on ride activity for a given day or week. This would help identify/predict abnormal activity in this and future data sets. Additionally, it would provide insight to Citi Bike as to how to better plan and prepare for such events to boost rental count and increase trip duration.
  • GPS Enabled Bike Computers: Though not influenced by the data we have at hand, adding bicycle tracking hardware to each Citi Bike rental would provide substantial value to future data sets. Adding GPS tracking would enable Citi Bike to track specific routes followed by clients and could even aid NYC planners with transportation projects. Having route information means that true distance covered would be available, an attribute that would have far more meaning than our LinearDistance attribute. Incorporating GPS tracking with bike speed would provide insights into individual rider activity. For example, just because a rider's trip duration was 6 hours doesn't mean they actively rode for that amount of time. It is far more likely such a rider would have stopped for an extended period of time at least once during this period of time. Adding GPS and speed data would alleviate these existing gaps.

Deploying the Chosen Model on new data

We have discussed above, what value our model holds, our preferred method of deployment, and frequency of model updates. Below we will walk through the process for prepping a real-time prediction model for deployment, and actually executing model predictions on new data inputs.

Prepping the Model for Deployment

A key component in deploying our model is the re-use of data transformation and / or Model fit objects created during this process. We need to be able to apply new data into the same constructs listed below, which were utilized in the Testing and Training process:

  • Standard Scaler
  • Model CLF Fit (RF decision trees)
  • Cluster Column List for Random Forest Prediction Input

To do this, we must take our python objects currently stored in our active Kernel and permanently store them in a "Pickled" (.pkl) file. This .pkl file is a serializes version of our python object which can be accessed later on. To prove our pickled files are being utilized instead of previously created objects, we have executed the %reset jupyter command to clear all environment objects before proceding.

In [137]:
import pickle 
from datetime import timedelta

def pickleObject(objectname, filename, filepath = "PickleFiles/"):
    fullpicklepath = "{0}{1}.pkl".format(filepath, filename)
    # Create an variable to pickle and open it in write mode
    picklefile = open(fullpicklepath, 'wb')
    pickle.dump(objectname, picklefile)
    picklefile.close()
In [138]:
SubClusterCols = CitiBike_S_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1).columns.values.tolist() 
CusClusterCols = CitiBike_C_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1).columns.values.tolist() 
In [139]:
objectlist   = [scaler,   rfc_clf_Subscriber,    rfc_clf_Customer,   SubClusterCols,   CusClusterCols   ]
filenamelist = ["scaler", "rfc_clf_Subscriber",  "rfc_clf_Customer", "SubClusterCols", "CusClusterCols" ]

for i in range(0,len(objectlist)):
    pickleObject(objectlist[i], filenamelist[i])
In [140]:
%reset
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
In [141]:
import pickle 
from datetime import timedelta
from geopy.distance import vincenty
import holidays
from datetime import datetime
from dateutil.parser import parse
import glob
import pandas as pd
import numpy as np
from IPython.display import display
import pyowm
from IPython.display import display
Predictions on New Data

Below is a single data entry for a Customer upon the return of a bike rental. We have produces fake values, based on possible value ranges present in our original dataset from Citibike. For the purpose of this "real-time" prediction, the current date/time of execution is utilized for a user submitting a bike return.

In [142]:
%%time 

NewData = pd.DataFrame({
                        "tripduration": [579],
                        "starttime": [datetime.now() + timedelta(seconds = -579)],
                        "stoptime": [datetime.now()],
                        "start_station_id": [477],
                        "start_station_name": ["W 41 St & 8 Ave"],
                        "start_station_latitude": [40.756405],
                        "start_station_longitude": [-73.990026],
                        "end_station_id": [441],
                        "end_station_name": ["E 52 St & 2 Ave"],
                        "end_station_latitude": [40.756014],
                        "end_station_longitude": [-73.967416],
                        "bikeid": [16537],
                        "usertype": ["Customer"],
                        "birthyear": ["\\N"],
                        "gender": [0]
                      })

display(NewData)
bikeid birthyear end_station_id end_station_latitude end_station_longitude end_station_name gender start_station_id start_station_latitude start_station_longitude start_station_name starttime stoptime tripduration usertype
0 16537 \N 441 40.756014 -73.967416 E 52 St & 2 Ave 0 477 40.756405 -73.990026 W 41 St & 8 Ave 2017-04-16 23:06:09.026501 2017-04-16 23:15:48.026501 579 Customer
Wall time: 91.7 ms

Data Transformations and Additional Features

As was done previously during Data Understanding and Preparation, there are several attributes that we computed and / or obtained from third parties to complete our dataset. The following attributes are needed to be created:

  • LinearDistance
  • DayOfWeek
  • TimeOfDay
  • HolidayFlag
  • TMIN
  • TAVE
  • TMAX
  • PRCP
  • SNOW

Since the current time was utilized for our bike return, we are able to utilize the openweathermap API for current weather forecasts for the current day. For this report, only the free features of this API are available. In a true production implementation, we recommend a subscription unlocking additional features for more accurate results and to increase the "Calls per Minute" limitations imposed.

In [143]:
%%time 

    ## Compute LinearDistance from Start/End Lat,Long Coordinates
NewData["LinearDistance"] = NewData.apply(lambda x: vincenty((x["start_station_latitude"], x["start_station_longitude"]), 
                                                             (x["end_station_latitude"],   x["end_station_longitude"])).miles,
                                          axis = 1)

    
    ## Compute DayOfWeek from Start Time
        # starttime needs to be converted to a pandas datetime type before we can find the weekday name
NewData['starttime'] = pd.to_datetime(NewData['starttime'])
NewData["DayOfWeek"] = NewData['starttime'].dt.weekday_name

    ## Compute TimeOfDay from Start Time
        ##Morning       5AM-10AM
        ##Midday        10AM-2PM
        ##Afternoon     2PM-5PM
        ##Evening       5PM-10PM
        ##Night         10PM-5AM


NewData["TimeOfDay"] = np.where((NewData['starttime'].dt.hour >= 5) & (NewData['starttime'].dt.hour < 10), 'Morning',
                                np.where((NewData['starttime'].dt.hour >= 10) & (NewData['starttime'].dt.hour < 14), 'Midday',
                                         np.where((NewData['starttime'].dt.hour >= 14) & (NewData['starttime'].dt.hour < 17), 'Afternoon',
                                                  np.where((NewData['starttime'].dt.hour >= 17) & (NewData['starttime'].dt.hour < 22), 'Evening',
                                                           'Night' ### ELSE case represents Night
                                                          )
                                                 )
                                        )
                               )
                                                  
    ## Compute LinearDistance from Start/End Lat,Long Coordinates
NewData["HolidayFlag"] = NewData['starttime'].isin(holidays.UnitedStates())
NewData["HolidayFlag"] = np.where(NewData["HolidayFlag"] == False, 0, 1)
Wall time: 253 ms
In [144]:
%%time 


owm = pyowm.OWM('462d2effa0ba127689b824b37efc9d12')  # You MUST provide a valid API key
Wall time: 1.19 s
In [145]:
%%time 

forecaster = owm.three_hours_forecast_at_coords(lat = float(NewData["start_station_latitude"]), lon = float(NewData["start_station_longitude"]))
forecast = forecaster.get_forecast()
fweather_list = forecast.get_weathers()
Wall time: 980 ms
In [162]:
%%time 

tlist = []
plist = []
slist = []


for x in fweather_list:
    date = datetime.date(datetime.strptime(x.get_reference_time('iso'),"%Y-%m-%d %H:%M:%S+00") + timedelta(hours = -5))
    
    if (date >= datetime.date(NewData["starttime"].min() + timedelta(days=1))) \
    and (date < datetime.date(NewData["starttime"].min())+ timedelta(days = 2)): 
        temp = x.get_temperature('fahrenheit')['temp']
        prcp = x.get_rain()
        snow = x.get_snow()
        tlist.append(temp)
        
        if prcp == {}:
            plist.append(0)
        else:
            plist.append(prcp)
        
        if snow == {}:
            slist.append(0)
        else:
            slist.append(snow)
        
tempdata = pd.DataFrame(tlist)
prcpdata = pd.DataFrame(plist)
snowdata = pd.DataFrame(slist)

NewData["TMIN"] = tempdata.min()
NewData["TAVE"] = tempdata.mean()
NewData["TMAX"] = tempdata.max()
NewData["PRCP"] = prcpdata.sum()
NewData["SNOW"] = snowdata.sum()
Wall time: 141 ms

With additional features added, we have several missing values to scrub, the tripdurationlog attribute to compute, and data type conversions to apply to our data.

In [163]:
%%time 

# Replace '\N' Birth Years with Zero Values
NewData["birthyear"] = NewData["birthyear"].replace(r'\N', '0')

# Convert Columns to Numerical Values
NewData[['tripduration', 'birthyear', 'LinearDistance', 'PRCP', 'SNOW', 'TAVE', 'TMAX', 'TMIN']] \
    = NewData[['tripduration', 'birthyear', 'LinearDistance', 'PRCP', 'SNOW', 'TAVE', 'TMAX',
                            'TMIN']].apply(pd.to_numeric)

# Convert Columns to Date Values
NewData[['starttime', 'stoptime']] \
    = NewData[['starttime', 'stoptime']].apply(pd.to_datetime)

# Compute Age: 0 Birth Year = 0 Age ELSE Compute Start Time Year Minus Birth Year
NewData["Age"] = np.where(NewData["birthyear"] == 0, 0,
                                       NewData["starttime"].dt.year - NewData["birthyear"])

# Convert Columns to Str Values
NewData[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag', 'gender']] \
    = NewData[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag', 'gender']].astype(str)

# Log Transform Column Added
NewData["tripdurationLog"] = NewData["tripduration"].apply(np.log)
    
display(NewData)
bikeid birthyear end_station_id end_station_latitude end_station_longitude end_station_name gender start_station_id start_station_latitude start_station_longitude start_station_name starttime stoptime tripduration usertype LinearDistance DayOfWeek TimeOfDay HolidayFlag TMIN TAVE TMAX PRCP SNOW Age tripdurationLog DayOfWeek_Monday DayOfWeek_Tuesday DayOfWeek_Wednesday DayOfWeek_Thursday DayOfWeek_Friday DayOfWeek_Saturday DayOfWeek_Sunday TimeOfDay_Morning TimeOfDay_Midday TimeOfDay_Afternoon TimeOfDay_Evening TimeOfDay_Night
0 16537 0 441 40.756014 -73.967416 E 52 St & 2 Ave 0 477 40.756405 -73.990026 W 41 St & 8 Ave 2017-04-16 23:06:09.026501 2017-04-16 23:15:48.026501 579 Customer 1.186684 Sunday Night 0 58.51 63.39625 66.03 0 0 0 6.361302 0 0 0 0 0 0 1 0 0 0 0 1
Wall time: 44.1 ms

Encoding categorical attributes

As was performed during our data preparation section of this report, we have several categorical data values which require encoding. Since we do not have the ability to execute dummy logic as before, we manually searched for the permutation of values assigning 1 if matched and 0 if not. The below categorical attributes have been encoded below:

  • DayOfWeek
  • TimeOfDay
In [164]:
%%time 

## DayOfWeek
NewData["DayOfWeek_Monday"] = np.where(NewData["DayOfWeek"] == "Monday", 1, 0)
NewData["DayOfWeek_Tuesday"] = np.where(NewData["DayOfWeek"] == "Tuesday", 1, 0)
NewData["DayOfWeek_Wednesday"] = np.where(NewData["DayOfWeek"] == "Wednesday", 1, 0)
NewData["DayOfWeek_Thursday"] = np.where(NewData["DayOfWeek"] == "Thursday", 1, 0)
NewData["DayOfWeek_Friday"] = np.where(NewData["DayOfWeek"] == "Friday", 1, 0)
NewData["DayOfWeek_Saturday"] = np.where(NewData["DayOfWeek"] == "Saturday", 1, 0)
NewData["DayOfWeek_Sunday"] = np.where(NewData["DayOfWeek"] == "Sunday", 1, 0)

## TimeOfDay
NewData["TimeOfDay_Morning"] = np.where(NewData["TimeOfDay"] == "Morning", 1, 0)
NewData["TimeOfDay_Midday"] = np.where(NewData["TimeOfDay"] == "Midday", 1, 0)
NewData["TimeOfDay_Afternoon"] = np.where(NewData["TimeOfDay"] == "Afternoon", 1, 0)
NewData["TimeOfDay_Evening"] = np.where(NewData["TimeOfDay"] == "Evening", 1, 0)
NewData["TimeOfDay_Night"] = np.where(NewData["TimeOfDay"] == "Night", 1, 0)

display(NewData)
bikeid birthyear end_station_id end_station_latitude end_station_longitude end_station_name gender start_station_id start_station_latitude start_station_longitude start_station_name starttime stoptime tripduration usertype LinearDistance DayOfWeek TimeOfDay HolidayFlag TMIN TAVE TMAX PRCP SNOW Age tripdurationLog DayOfWeek_Monday DayOfWeek_Tuesday DayOfWeek_Wednesday DayOfWeek_Thursday DayOfWeek_Friday DayOfWeek_Saturday DayOfWeek_Sunday TimeOfDay_Morning TimeOfDay_Midday TimeOfDay_Afternoon TimeOfDay_Evening TimeOfDay_Night
0 16537 0 441 40.756014 -73.967416 E 52 St & 2 Ave 0 477 40.756405 -73.990026 W 41 St & 8 Ave 2017-04-16 23:06:09.026501 2017-04-16 23:15:48.026501 579 Customer 1.186684 Sunday Night 0 58.51 63.39625 66.03 0 0 0 6.361302 0 0 0 0 0 0 1 0 0 0 0 1
Wall time: 40.6 ms

The next step to deploying predictions on new data is to "unpickle" the .pkl file objects needed for our model. We need to unpickle the below constructs in order to re-produce our model, and apply it on new data:

  • Standard Scaler
  • Model CLF Fit (RF decision trees)
  • Cluster Column List for Random Forest Prediction Input
In [165]:
%%time
def unpickleObject(filename, filepath = "PickleFiles/"):
    fullunpicklepath = "{0}{1}.pkl".format(filepath, filename)
    # Create an variable to pickle and open it in write mode
    unpicklefile = open(fullunpicklepath, 'rb')
    unpickleObject = pickle.load(unpicklefile)
    unpicklefile.close()
    
    return unpickleObject
Wall time: 0 ns
In [166]:
%%time
unpickleObjectList = []
filenamelist = ["scaler","rfc_clf_Subscriber",  "rfc_clf_Customer", "SubClusterCols", "CusClusterCols" ]

for i in range(0,len(filenamelist)):
    unpickleObjectList.append(unpickleObject(filenamelist[i]))

scaler = unpickleObjectList[0]
rfc_clf_Subscriber = unpickleObjectList[1]
rfc_clf_Customer = unpickleObjectList[2]
SubClusterCols = unpickleObjectList[3]
CusClusterCols = unpickleObjectList[4]
Wall time: 7.52 ms

In order to establish the data in the standard scaler form it was in during clustering, we must transform our data similarly to how it was done during training. To do this, we utilize the transform operation under the unpickled scaler object in order to acheive similar results as if this record was part of the original trained dataset.

In [167]:
%%time
NewData_Scaled = pd.DataFrame(scaler.transform(NewData[SubClusterCols]), columns = [SubClusterCols])
display(NewData_Scaled)
start_station_latitude start_station_longitude end_station_latitude end_station_longitude PRCP SNOW TAVE TMAX TMIN DayOfWeek_Friday DayOfWeek_Monday DayOfWeek_Saturday DayOfWeek_Sunday DayOfWeek_Thursday DayOfWeek_Tuesday DayOfWeek_Wednesday TimeOfDay_Afternoon TimeOfDay_Evening TimeOfDay_Midday TimeOfDay_Morning TimeOfDay_Night tripdurationLog
0 1.104717 0.109282 1.091983 1.845698 -0.305915 -0.069606 -0.114631 -0.377781 0.044255 -0.399214 -0.385424 -0.472244 2.267759 -0.381894 -0.385824 -0.388611 -0.541936 -0.694351 -0.544202 -0.407459 3.466716 -0.431971
Wall time: 34.6 ms

With scaled values transformed, we can create columns for "Cluster_ID_Subscriber" and "Cluster_ID_Customer" using the unpickled clf objects from the random forest classification models produced earlier.

In [168]:
%%time
X = NewData_Scaled[SubClusterCols]

NewData_Scaled["Cluster_ID_Subscriber"] = rfc_clf_Subscriber.predict(X)
Wall time: 108 ms
In [169]:
%%time
X = NewData_Scaled[CusClusterCols]

NewData_Scaled["Cluster_ID_Customer"] = rfc_clf_Customer.predict(X)
Wall time: 107 ms
In [170]:
%%time
display(NewData_Scaled.transpose())
0
start_station_latitude 1.104717
start_station_longitude 0.109282
end_station_latitude 1.091983
end_station_longitude 1.845698
PRCP -0.305915
SNOW -0.069606
TAVE -0.114631
TMAX -0.377781
TMIN 0.044255
DayOfWeek_Friday -0.399214
DayOfWeek_Monday -0.385424
DayOfWeek_Saturday -0.472244
DayOfWeek_Sunday 2.267759
DayOfWeek_Thursday -0.381894
DayOfWeek_Tuesday -0.385824
DayOfWeek_Wednesday -0.388611
TimeOfDay_Afternoon -0.541936
TimeOfDay_Evening -0.694351
TimeOfDay_Midday -0.544202
TimeOfDay_Morning -0.407459
TimeOfDay_Night 3.466716
tripdurationLog -0.431971
Cluster_ID_Subscriber 3.000000
Cluster_ID_Customer 1.000000
Wall time: 8.02 ms

With our clusters added to the dataset, the bike return data for this user is now finally ready to be utilzied for a classication / regression model of choice, including both Subscriber and Customer clusters. As discussed previously, given the purpose of this report and the focus on clustering - this model, and any data preparation steps for this model have not been prepared.

Note that the run-time for all steps during the transformation/addition of new features to the original input complete in under 1 second during execution. This, of course, does not include the classification / regression model, but is a realistic timeframe up to this point for real-time application.

Exceptional Work

The following items were especially challenging during this project, we felt noteworthy to mention:

  • High Dimensionality: Most documentation found for clustering in general is based on 2D or 3D clustering. We wanted to produce clusters that attempted to describe the dataset in full. This produced several challenges when it came to runtimes(single iterations for fitting clusters taking 30 minutes to 3 hours each), and visualizing clusters in a way that made sense for contextual interpretations.
  • Utilization of R in jupyter: Only 1 of our 3 team members were able to correctly run rpy2 on their machines for this lab. This produced challenges as we tried to collaborate on work and use objects created from R code downstream in the project.
  • Prediction of clusters: The rubric for this lab seemed to stop sooner than we thought relevant for a real-world application. We found that most cluster algorithms did not have a "predict" type of function. Once clusters were fit, you had the clusters for those observations, but not the cluster for new observations. Given the fact that our spectral clustering was so limited in terms of sample size, we had the majority of our dataset left unclustered. This, in application, would not be very useful. To mitigate this challenge, we decided to build a classification model utilizing random forest in order to train (10 fold) on the fit clusters - to fit a model ready to predict the missing cluster values efficiently for both subscriber and customer observations.
  • Identifying the "Value" in our Clusters through PCA: Once clusters were added to the dataset, we did not have peace of mind that the clusters were going to actually add "value" to Citibike. Without the time, nor project focus, on building a full classification model on predicting usertype - we still wanted to attempt to validate in some way our claims to valid clusters and uncover further insights in the process. We accomplished this through the utilization of PCA. After identifying the correct principal components within our dataset, we searched through the components looking for our newly added cluster columns. There were several Principal components which included our cluster columns, and two in particular that had 6 of the 8 cluster columns included within the top 15 attributes for the PC loading. This gave us peace of mind, that there was a strong possibility that these cluster columns could add value in a classification model downstream.
  • Deployment: In the last lab, we identified our recommendation for the value, deployment methodology, and frequency of updates for a classification model on usertype. The goals for this model remained the same during this lab, with the addition of our predicted clusters appended to the dataset - and the value added for these new attributes. Although, deployment methodology was identified, our team wanted to go one step further to produce code that could take our implemented data feature additions, transformations, and cluster predictions - and apply them to a new data observation. This code produces output, per our recommended real-time deployment methodology, that identifies all data needed(including cluster predictions) on a new data observation. This was done through utilization of .pkl files after a jupyter environment object reset to prove out the similar "clean" environment you may see in a Citibike return terminal and / or application. With these tasks produced, in addition to the classification work produced in the last lab, we feel as though this deployment area code would be close to ready for a production system.